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Abstract 

LR  parsing  is  used  for  a  wide  range  of  applications,  including  compiler  construction,  au- 
tomatic code  generation,  language-specific  editors  and  natural  language  processing.  Cur- 
rently, however,  solutions  have  not  been  developed  for  practical  muitiple-lookahead  parsing, 
fully-automatic  error  recovery,  and  space  and  time-efficient  LR  parsing  across  the  wide- 
range  of  applications. 

A  practical  framework  for  LR(k)  parsing  is  introduced.  An  efficient  algorithm  in- 
crementally constructs  an  LALR(k)  parser  with  varying-iength  lookahead  strings,  whose 
symbols  are  consulted  during  parsing  only  when  necessary.  Currently,  effective  LR  error 
recovery  systems  require  some  user  intervention.  .A.n  effective  and  fully  automated  syntac- 
tic error  recovery  method  for  LR(k)  parsers  is  presented.  A  generally  effective  method  for 
compressing  LR(k)  parsing  tables  is  also  presented. 

These  innovations  have  been  incorporated  into  a  parser  generator  system  whicli  auto- 
matically produces  a  production-quality  parser  with  error  diagnostics  and  recovery. 
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Chapter  1 

LALR(fc)  Parsing 


1.1     Introduction 

In  1965,  Knuth  [1]  introduced  LR(A-)  parsing,  a  bottom-up  syntax  analysis  technique  tliat 
can  be  used  to  recognize  the  largest  class  of  deterministic  context-free  languages.  (The 
"L"  stands  for  left-to- right  scanning  of  the  input,  the  "R"  is  for  constructing  a  rightmost 
derivation  in  reverse,  and  the  k  is  for  the  number  of  input  symbols  of  lookahead  that  are 
used  in  making  parsing  decisions.)  Over  the  years,  this  parsing  method  has  attracted 
much  attention  because  in  addition  to  its  ability  to  recognize  a  large  class  of  languages  the 
resulting  parsers  offer  the  following  advantages: 

•  they  can  be  constructed  automatically  from  a  context-free  grammar  definition: 

•  they  are  time-efficient  since  they  can  accept  or  reject  an  input  in  a  single  left -to-right 
scan  of  it  with  no  backup; 

•  they  can  detect  an  error  at  the  earliest  possible  point. 

A  context-free  grammar  is  said  to  be  LR(A-)  if  an  LR(A-)  parser  can  be  successfully  con- 
structed from  it.  A  language  is  said  to  be  LR(A-)  if  it  can  be  defined  by  an  LR(A)  grammar. 
In  their  canonical  form.  LR(A-)  parsers  (when  k  >  0)  usually  require  too  much  space  to 
be  of  practical  use.  (The  relationship  between  an  arbitrary  context-free  grammar  and  the 
size  of  its  canonical  LR(A-)  parser  has  never  been  precisely  demonstrated,  but  for  a  typical 
programming  language  grammar,  when  k  =  0  the  parser  usually  contains  several  hundred 
states:  when  k  =  1.  parsers  with  several  thousand  states  are  common.)  .^s  a  result,  two 
variant^  of  LR(A-)  parsers  which  were  invented  by  DeRemer  have  gained  poi)ularity  over 
the  years.  They  are  known  as  Look.Ahead  LR(A-)  (L.-\LR(A-))  introduced  in  1969  and  de- 
scribed in  [3]  and  simple  LR(A-)  (SLR(A-))  introduced  in  197]  and  described  in  [-4].  These 
variants  of  LR( A- )  parsers  are  relatively  space-efficient  because  their  underlying  automaton 
is  the  LR(0)  machine,  regardless  of  the  value  of  A-.  The  set  of  languages  that  is  SLR(A)  is 
a  proper  subset  of  the  set  of  LALR(A-)  languages  which,  in  turn,  is  a  proper  subset  of  the 
set  of  LR(A)  languages.  However,  in  practice,  LALR(A)  grammars  are  used  because  they 
are  sufficiently  powerful  to  accomodate  most  programming  language  constructs. 


By  keeping  the  number  of  states  at  a  minimum,  the  SLR(A-)  and  LALR(A-)  variants 
help  reduce  the  space  requirement  of  an  LR(A-)  parser  while  retaining  the  speed  advantage 
of  the  latter. 

The  symbols  of  a  context-free  grammar  are  divided  into  two  classes:  terminals  (input 
symbols)  and  nonterminals  (phrase  symbols).  An  LR(A-)  parser  (or  variant)  for  a  context- 
free  grammar  is  a  deterministic  pushdown  automaton  that  can  be  represented  by  two 
matrices:  ACTION,  which  represents  the  mapping  of  a  parsing  action  function  and  GOTO, 
which  represents  the  mapping  of  a  goto  function.  These  matrices  will  be  referred  to 
generically  as  parsing  tables. 

The  parsing  action  function  takes  as  arguments  a  state  and  a  string  of  k  terminals 
(called  lookahead  string)  and  produces  one  of  four  values:  shift,  reduce  /.  error  or  ac- 
cept. The  goto  function  is  the  transition  function  of  the  automaton.  It  takes  as  arguments 
a  state  and  a  grammar  symbol  (terminal  or  nonterminal)  and  produces  either  a  new  state 
that  the  parser  should  enter  or  error.  Thus,  the  rows  of  the  parsing  tables  are  indexable 
by  a  stale  of  the  automaton,  each  column  of  ACTION  is  indexable  by  a  string  of  terminals 
of  length  k  and  each  column  of  GOTO  is  indexable  by  a  distinct  grammar  symbol.  Each 
entry  in  a  parsing  table  is  either  a  useful  entry  that  represents  a  valid  move  to  be  taken 
by  the  automaton  (for  the  corresponding  pair  of  indices)  or  an  error  entry. 

1.2      The  Problems 

As  can  be  observed  from  the  definition  of  the  parsing  tables,  for  a  given  grammar,  the 
number  of  states  (rows)  in  its  SLR(A-)  or  LALR(A)  automaton  and  the  number  of  columns 
in  its  GOTO  matrix  remain  fixed  for  any  value  of  k:  but,  the  number  of  columns  in  its 
ACTIO.N  matrix  is  exponential  with  respect  to  k.  However,  if  the  error  entries  are  kept 
as  empty  slots  these  matrices  are  very  sparse.  Typically,  less  than  2'7(.  of  the  entrie;,  in  the 
parsing  tables  of  an  L.\LR(1)  parser  are  useful. 

One  of  the  most  important  issues  in  LR  parsing  is  to  find  suitable  data  structures  for 
these  parsing  tables  whose  space  requirement  is.  at  worst,  proportional  to  the  number  of 
useful  entries  in  the  tables  but  whose  time-efficiency  is  comparable  to  that  of  the  matrix 
representation.  .Another  important  issue  is  that  of  providing  an  efficient  error  recovery 
system  for  this  parsing  framework.  In  particular,  the  LR(A-)  variants  lose  the  inherent 
ability  of  their  canonical  counterpart  to  detect  an  error  at  the  earliest  po.ssible  point. 

The  attempt  to  make  LR(A-)  parsers  more  useful,  that  is,  faster,  smaller  and  more 
automated  raises  the  following  important  questions: 

•  How  can  such  a  parser  be  constructed  efficiently? 

•  How  can  error  recovery  be  done  in  this  framework *?  More  specifically,  can  an  auto- 
matic or  semi-automatic  error  recovery  system  be  constructed  that  works  with  all 
LR(A)  parsers? 

•  What  is  the  relationship  between  the  specific  type  of  parsing  and  the  speed  and  size 
of  the  resulting  automaton?  (For  example,  the  use  of  extra  lookahead  symbols  can 
potentiaUy  affect  both  size  and  speed). 

•  How  can  the  parse  tables  be  represented  compactly  without  sacrificing  speed'? 


Several  papers  have  been  published  that  address  these  problems.  In  the  following 
subsections,  the  results  of  some  of  the  seminal  work  in  these  areas  are  briefly  described 
followed  by  a  description  of  the  major  innovations  of  this  thesis. 

1.2.1     LALR(A)  Parser  Construction 

LALR(A-)  parsers  are  almost  always  used  because  they  are  more  space-efficient  than  both 
the  LR(A:)  and  SLR(A-)  parsers  and,  in  addition,  they  are  more  powerful  than  the  SLR(A-) 
parsers.  Most  commercially  available  parser  generators  only  deal  with  the  case  of  k  =  1. 

An  LALR(A-)  parser  can  be  constructed  by  first  building  an  LR(/c)  parser  and  then 
merging  some  states.  However,  this  approach  is  impractical  since  it  is  usually  difficult 
to  construct  the  LR(/r)  parser  because  of  its  large  space  requirement.  Instead,  a  two- 
step  approach  is  usually  taken.  In  the  first  step,  an  LR(0)  automaton  is  constructed  (all 
LALR(A-)  parsers  are  based  on  this  automaton);  and  in  the  second  step  the  resulting  tables 
are  augmented  with  the  necessary  lookahead  actions. 

Informally,  an  LR(0)  item  is  a  context-free  grammar  rule  with  a  marker  that  separates 
its  right-hand  side  into  a  prefix  that  has  been  processed  and  a  suffix  that  has  yet  to  be 
processed.  An  item  is  called  final  when  the  marker  indicates  that  its  prefix  con^^ists  of 
the  whole  right-hand  side  (and  its  suffix  is  empty).  Each  state  of  an  LR(0)  automaton 
corresponds  to  a  set  of  items. 

The  LALR(A-)  lookahead  set  for  a  final  item  in  a  state  of  an  LR(0)  automaton  is  a  set 
of  terminal  strings  of  length  k\  During  parsing,  if  the  next  k  symbols  in  the  input  matches 
one  of  the  lookahead  strings,  the  parser  should  perform  a  reduction  by  the  specific  rule 
from  which  the  relevant  item  is  derived.  Hence,  in  the  second  step  of  the  construction  of 
an  LALR(A-)  parser,  for  each  slate  and  each  lookahead  string  element  of  a  lookahead  set 
computed  in  that  state,  the  corresponding  ACTIO.N  matrix  entry  is  updated  to  indicate 
the  relevant  reduce  action.  Thus,  the  main  issue  in  constructing  an  L.\LR(A-)  parser  is: 
hov  can  Iht  lookahtad  sets  bt  computtd  (fficitntly  uiih  rtspici  to  both  !>pac(  and  tmn? 

DeRemer  presented  an  algorithm  for  constructing  SLR(A-)  parsers  [4].  However,  an 
algorithm  to  compute  lookahead  sets  for  an  LALR  parser  directly  on  the  LR(0)  automaton 
(even  for  the  case  A-  =  1)  had  remained  elusive  until  1971,  when  Lalonde  [G]  presented 
his  algorithm.  Other  algorithms  for  computing  L.\LR(1)  lookahead  sets  on  the  LR(0) 
automaton  were  published  throughout  the  1970"s.  including  the  method  used  in  Vacc  [12] 
which  is  also  described  in  detail  in  [33].  and  the  lane-tracing  method  of  Pager  [1-3].  These 
algorithms  were  not  very  efficient. 

Much  progress  was  made  in  the  area  of  computing  LALR(l)  lookahead  sets  in  the 
1980"s.  The  algorithm  of  DeRemer  and  Penello.  published  in  1982  [26].  still  remains 
the  best,  especially  when  it  is  implemented  with  some  improvements  suggested  in  [41]. 
Bermudez  and  Logothethis.  in  1989  [39].  proposed  a  theoretically  interesting  approach 
that  reduces  the  problem  of  computing  LAFOLLOW  sets,  which  are  used  to  compute 
LALR  lookahead  sets,  to  the  problem  of  computing  regular  FOLLOW  sets,  wliich  are  used 
to  construct  SLR  lookahead  sets.  The  time  complexity  of  this  algorithm  is  potentially 
the  same  as  that  of  the  DeRemer  and  Penello  algorithm,  as  will  be  shown  later,  but  the 
space  requirement  is  greater.  Other  algorithms  have  also  been  published,  most  notably: 
Krisiensen  and  Madsen.  1981  [24]:  Park.  Choe  and  Chang.  1985  [32];  Ives.  1986  [36].  All 


of  these  algorithms,  however,  are  less  efficient  than  the  DeRemer  and  Penello  approach. 

Only  Kristensen  and  Madsen  generalized  their  algorithm  to  compute  lookaliead  sets 
for  k  >  1.  However,  their  generalized  algorithm  is  only  of  theoretical  interest,  in  that 
it  computes  the  complete  lookahead  sets  required  for  each  ambiguous  state.  As  can  be 
observed  from  the  definition  of  the  ACTIO.X  matri.x,  the  problem  of  computing  the  full 
lookahead  sets  for  an  LALR(Ar)  parser  is  inherently  intractable  in  that  the  size  of  the 
solution  itself  may  be  exponential  with  respect  to  k. 

1.2.2  Error  recovery 

Error  recovery  is  traditionally  divided  into  simple  recovery  [21]  [37].  phrase-level  (or  sec- 
ondary) recovery  [9]  [21]  [27]  [37].  and  scope  recovery  [37]. 

In  simple  recovery,  an  attempt  is  made  to  repair  an  erroneous  input  by  using  primitive 
editing  operations  on  the  error  symbol.  That  is.  a  symbol  may  be  inserted  before  it.  it 
may  be  replaced  by  another  symbol,  or  it  may  be  deleted. 

In  plirase-level  recovery,  a  sequence  of  zero  or  more  tokens  in  the  vicinity  of  tiie  error 
symbol  is  discarded  from  the  input  or  replaced  by  a  nonterminal.  The  erroi  prodvctions 
approach  of  '^'acc  is  a  form  of  secondary  recovery  where  the  nonterminal  candidate>  to  be 
chosen  for  this  kind  of  repair  are  identified  by  productions  whose  right-hand  sides  include  a 
special  terminal  symbol  called  the  error  symbol.  Sippu  and  Soisalon-Soininen  [27]  presented 
a  more  sophisticated  method  for  secondary  recovery  which  does  not  require  the  use  of 
error  productions  but  is  somewhat  expensive  in  that  it  requires  that  some  information  be 
computed  at  run-time. 

Scope  recovery  was  introduced  by  Burke  and  Fisher.  The  idea  is  to  insert  a  sequence  of 
closing  syntactic  fragments  into  the  text,  where  appropriate,  to  complete  the  specification 
of  certain  blocks  or  block-like  structures.  This  recovery  approach  is  very  effective  when 
used  in  conjunction  with  primary  and  secondary  recovery  as  suggested  in  [37].  However, 
each  relevant  closing  fragment  had  to  be  specified  explicitly  as  a  sequence  of  terminal 
symbols.  Therefore,  their  method  required  that  a  user  be  famihar  with  the  language  in 
question. 

The  error  recovery  method  of  Burke  and  Fisher  is  the  most  practical  and  effective 
metiiod  to  date.  However,  it  is  based  on  a  deferred  parsing  technique  which  requires  a 
double  parsing  of  the  input  even  for  correct  programs.  In  addition  to  the  introduction 
of  scope  recovery,  Burke  and  Fisher  also  made  .some  improvements  in  primary  recovery 
by  considering  mtrging  o^  two  adjacent  tokens  and  mis^ptlliriy  of  keywords.  Other  error 
recovery  methods  (e.g.  [17])  have  been  published,  but  they  are  mostly  of  theoretical  interest 
and  are  not  used  in  practice. 

1.2.3  Parsing  Tables 

The  issue  of  LR  parsing  table  compression  has  been  widely  studied,  but  up  to  now.  no 
general  method  has  been  produced  that  is  well-suited  across  the  rajige  of  different  applica- 
tions. Table  compression  is  still  treated  in  the  literature  as  a  time  versus  space  problem. 
Depending  on  the  application,  techniques  from  sparse  matrix  representation  with  sequen- 
tial searching,  hashing,  and  other  more  time-eflricient  but  space-consuming  direct  access 
methods  have  been  proposed. 


The  table  compression  technique  used  in  Yacc  [9]  consists  of  a  combination  of  direct 
access  techniques  for  transitions  and  sequential  search  techniques  for  reduce  actions.  For 
smaU  grammars,  this  is  an  acceptable  approach.  However,  this  approach  does  not  always 
perform  well  on  large  grammars.  Tarjan  and  Yao  [23]  published  an  analysis  of  the  direct 
access  method  of  Ziegler,  and  formulated  the  precise  conditions  under  which  that  compres- 
sion technique  performs  weU.  In  1984,  Dencker,  Durre  and  Heuft  [30]  presented  a  direct 
access  technique  based  on  graph  coloring  which  does  very  well  in  minimizing  space.  Unfor- 
tunately, their  approach  requires  referencing  a  packed  boolean  matrix  to  test  the  validity 
of  each  action.  In  practice,  this  test  renders  their  method  slower  than  the  Yacc  method. 

1.3      Contributions  of  the  Thesis 

In  this  thesis,  several  contributions  are  made  in  each  of  the  areas  mentioned  above.  These 
results  have  been  integrated  in  a  parser  generator  system  that  automatically  produces 
efficient  L.'\LR(A-)  parsers  with  error  recovery  from  a  context-free  grammar  definition. 
These  innovations  are  summarized  as  foUows: 

•  .A.  new  framework  for  L.A.LR(A-)  parsers.  As  pointed  out  before,  the  construction  of  a 
traditional  LALR(A:)  parser  is  impractical  since  the  size  of  the  lookahead  sets  required 
for  such  a  parser  can  be  exponential.  The  approach  taken  in  this  method  can  be  best 
described  as  generating  an  L.\LR(A-)  parser  with  voriabk-lfngth  look-ahead  strings.  A 
lookahead  set  for  a  final  item  in  a  given  state  of  an  LR(A-)  parser  consists  of  the  set 
of  strings  of  length  k  that  may  appear  in  the  input  when  the  parser  enters  that  state. 
In  an  L.\LR(A)  parser  with  variable-length  lookahead  strings,  each  lookahead  set  is 
replaced  by  a  minimum  subset  of  prefixes  of  its  string  elements  that  is  sufficient  to 
render  the  parser  deterministic.  This  new  framework  is  discussed  in  chapter  3. 

•  A  practical  algorithm  for  constructing  variable  LALR(A-)  parsers.  This  method  not 
only  computes  the  minimum  amount  of  lookahead  information  required  but  it  does 
it  in  an  incremental  fashion.  Thus,  the  space  needed  to  construct  these  sets  is  kept 
to  a  minimum.  This  algorithm  is  presented  in  section  3.3. 

•  A  fully  automatic  error  recovery  method  which  is  more  practical  and  efficient  than 
other  known  methods.  This  language-  and  machine-independent  method  is  applicable 
to  all  forms  of  LR(A-)  parsing  but  it  is  especially  effective  in  the  context  of  a  parser 
generated  by  the  above  method.  Error  Recovery  is  the  subject  of  chapter  4. 

•  A  practical  and  effective  method  for  compressing  LR(A-)  parsing  tables.  This  com- 
pression method  is  also  applicable  to  all  forms  of  LR(A-)  parsers,  but  it  is  particularly 
effective  in  this  framework.  Table  compression  is  covered  in  chapter  5. 


Chapter  2 

The  Parser  Generator 

2.1      Basic  Concepts  and  Terminology 

A  context-free  grammar  (CFG)  is  a  quadruple  (.\,T,P,S),  where  N  is  a  finite  set  of  non- 
terminal symbols.  T  is  a  finite  set  of  terminal  symbols  distinct  from  A".  5  is  a  distinguislied 
symbol  of  N  called  the  start  symbol,  and  P  is  a  finite  set  of  productions,  each  of  the  form 
A  —  lj,  where  A  €  A'  and  ^  €  V".  Given  a  grammar  6',  V  (the  vocabulary)  stands  for 
N  UT. 

Lower-case  Greek  letters  such  as  q.  J  and  ■)  are  used  to  denote  strings  in  1".  Lower- 
case Roman  letters  at  the  beginning  of  the  alphabet  {a.  b.  c)  and  /  are  used  to  denote 
symbols  in  T  while  those  near  the  end  of  the  alphabet  {i.y.z]  denote  strings  in  T'.  Upper- 
case letters  near  the  beginning  of  the  alphabet  (A.B.C)  denote  nonterminals  in  .V  while 
those  near  the  end  (A'.V.Z)  denote  symbols  in  V.  The  empty  symbol  is  denoted  <  and  the 
empty  string  is  denoted  r.  The  end-of-file  token  is  denoted  by  ±.  The  length  of  a  string  - 
is  denoted  |-)|. 

The  following  SETL2  [42]  notation  will  also  be  used.  The  symbol  fi  denotes  the  special 
"undefined  value"  constant.  A  finite  ordered  sequence  of  arbitrary  elements,  called  a  tvj)lc. 
will  be  denoted  by  listing  the  elements  in  the  correct  order,  within  the  brackets  "['  and 
■]".    If  7  is  a  tuple.  T(i)  is  the  uh  element  of  T  and  7"(T7}..n)  is  the  tuple  consisting  of 

the  elements  T(m).  T{w  +  \) r(n),  if  n)  >=  n  and  the  empty  tuple.  [].  otherwise.  If 

T]  and  T2  are  tuples,  then  7"]  +  T2  is  the  tuple  obtained  by  appending  the  sequence  of 
elements  in  T2  at  the  end  of  the  sequence  of  elements  of  T-\.  A  single- valued  map  from  a 
finite  set  A  (the  domain)  to  a  finite  set  B  (the  range)  wiU  be  represented  as  a  set  of  ordered 
pairs  [r.y].  where  i  £  A.  y  e  B  and  each  element  of  .4  is  mapped  to  at  most  one  element 
of  B.  Given  a  map  j\/,  and  an  element  x  in  its  domain.  M{i)  represents  the  clement  y  in 
the  range  of  M  that  is  paired  with  x  {y  is  called  the  imagt  of  x).  If  A'  is  a  tuple,  set  or 
map.  its  length  or  cardinabty  is  denoted  #A'. 

From  now  on  it  is  assumed  that  a  given  grammar  G  has  been  augmented  with  a  new 
starting  rule  5'  —  S±*  and  G  contains  no  vse.ksf'  nonterminals.  A  nonterminal  A  is  said 
to  be  useless  if  it  does  not  generate  any  string  of  terminals:  i.e..  A  -/-"^  ic  for  any  u-  €  T'. 

For  a  given  context-free  grammar. 

FIRST;t(Q)  =  {x  \  (o  x>[„  x3  and    |j|  =  k)  or   (o  =>'  x  and    |t|  <  k)]. 


That  is,  FIRST^fo)  consists  of  aU  terminal  prefixes  of  length  k  (or  less  if  o  derives 
a  termJncd  string  of  length  less  than  k)  of  the  terminal  strings  than  can  be  derived  from 
o.  Closely  related  to  the  FIRSTjt  function  is  the  s-free  first  function,  EFFt(o),  which  is 
defined  as  all  the  elements  of  FIRSTfc(Q)  whose  derivation  does  not  involve  replacing  a 
leading  nonterminal  by  £.  More  formally, 

EFTk(a)  =  {w\q  =>;„  0  =>;„  wx,  0  ^  Awx  VA  €  A"  and    {w]  =  FIRSTjt(u;x)} 


If  X  and  y  denote  arbitrary  strings  then  x.y  is  the  string  obtained  by  concatenating  the 
string  denoted  by  y  to  the  string  denoted  by  x.  Let  A/  and  N  be  two  sets  of  strings,  the 
concatenation  operation  is  extended  to  sets  of  strings  as  follows: 

M.N  =  {x.y  I  x€  M,  y  £  N) 

If  M.  .\  C  T-  then 

M  et  N  =  U{FIRSTt(u-)  I  w  €  M.N]) 

2.1.1      LR(A)  parsers 

An  LR(A-)  item  is  a  quadruple  {A.Q,p.u),  written  [A  —  o  ■  3.n],  where  A  —  qJ  €  P 
and  u  £  T''  IS  a  lookahead  set.  A  is  czdled  the  Itfi  side,  a  is  called  the  prefix.  ;3  is  caUed 
the  suffix.  The  first  symbol  in  3.  immediately  following  the  dot.  is  called  the  dot  symbol. 
When  3  =  s.  the  item  is  called  a  final  ilem  and  the  dot  symbol  is  considered  to  be  e. 

Let  K  be  a  sei  of  LR(A-)  items.  The  closure  of  A',  denoted  C LOSV RE{K)  is  defined 
as  the  smallest  set  satisfying  the  equation: 

CLOSURE(A-)  =  A'  U  {[5 -•-;..  r]  |  r  €  FIRSTA-liii), 

[A -a  •.B^,u]e  CLOSrRE(A'),  B  -  -  £  P) 

Let  p  be  a  closure  set.  The  kernel  sei  of  p,  denoted  KERN'EL(p)  is  the  smallest  subset 
of  the  LR(A)  items  in  p  such  that: 

p=  CLOSl"RE(KERNEL(p)) 

Given  a  set  of  items,  p.  for  each  dot  symbol  .V  that  appears  in  an  item  of  p.  a  goto 
function:  GOTO^c-  is  defined  on  the  pair  (p,  A')  as  follows: 

GOTOk(p.X)=  CLOSrRE({[.4-QA'-i3,u]  1  [.4-  o-X3.v]e  p)) 

For  a  given  grammar  G  =  (.\,T,P,S),  the  canonical  set  of  LR(A')  items  for  G\  denoted 
/Jp.  can  be  constructed  with  the  following  procedure  given  a  closure  function  (to  compute 
CLOSrRE(A')  for  some  set  of  items  A')  and  GOTO^. 

1.  Initialize  /f  =  0 

2.  Start  with  a  kernel  set  consisting  solely  of  the  initial  item:   [S'  —  -5]:  compute  its 
closure  set  and  add  that  closure  set  to  I^. 


3.  Chose  a  closure  set  p  from  I^ .  Compute  its  set  of  dot  symbols  and  apply  the  GOTOt 
function  on  p  and  each  of  its  dot  symbols.  If  any  new  closure  sets  not  yet  in  I^  are 
thus  obtained  they  are  added  to  /Jp. 

4.  Repeat  the  preceeding  step  until  no  more  new  closure  sets  can  be  added  to  I^ . 

This  algorithm  must  clearly  terminate,  since  the  set  of  items  and  the  set  of  symbols 
are  finite. 

Definition  2.1.1  Let  G  be  a  context-free  grammar.  The  LR(k)  machine  for  G  is  a  triple: 
LRM^  =  (Mf,IS^,GOTO^),  where  jl/f  is  a  set  of  LR(k)  states,  one  for  each  set  of 
items  in  I^ .  IS^  is  the  initial  state  corresponding  to  the  closure  set  of  the  initial  item. 
GOTO^  is  the  GOTO  function  defined  on  A/f  x  V  —  A/f . 

Observe  that  a  state  p  in  A/*  is  characterised  by  its  kernel  set  since  the  complete  set 
of  items  making  up  that  state  can  be  reproduced  given  the  kernel  set  and  the  closure 
function.  For  convenience,  no  distinction  wiU  be  made,  from  now  on,  between  a  state  and 
its  corresponding  set  of  items.  Also,  for  a  given  grammar  G.  the  superscript  G  will  be 
omitted  whenever  this  omission  causes  no  confusion.  An  item  in  p  that  is  in  KER.\EL(7j) 
is  called  a  kernel  item  of  p.  An  item  in  p  that  is  not  in  KERXEL(p)  is  called  a  closure 
item. 

It  is  also  convenient  to  generalize  the  GOTO^t  function  for  arbitrary  strings  as  follows: 

GOTO;,(p.£)      =      p 

GOTO;t(p.-Vo)    =     GOTOi.{GOTOk{p.X).Q) 

Let  PRED  be  the  inverse  of  the  GOTO/;  function.  It  is  defined  on  arbitrary  strings  as 
follows: 

PRED(p.Q)  =  {q  I  GOTOt((y.o)  =  p] 

2.1.2     LALR(il)  parsers 

Tlie  notion  of  L.A.LR(A-)  parsers  is  captured  by  the  following  definitions  and   theorems 
presented  in  [2-4].  In  each,  let  G  be  a  CFG  with  LR(A-)  states  A/^.  k  >  0. 

Definition  2.1.2  Let  p  £  A/*,  thtn 

LRi(p.[.4-o-;3])=  {u\  [A-a-3.u]ep] 
Definition  2.1.3  Let  [A  —  a  ■  S.u]  bt  an  LR(k)  litm  and  lit  p  £  M^.  thtn 

C0RE([/1  -o-P],xi)=[A-Q-3] 

and 

CORE(p)=  {CORE(/):;€p} 

.No  distinction  is  made  between  the  items  [A  —  o  ■  d.s]  and  [.4  —  o  ■  3]. 
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Definition  2.1.4  Let  p  £  Mq,  then 

URCOREa.(p)  =  {q€  Mk\  C0RE(9)  =  p] 

URCORE  relates  an  LR(0)  state  p  with  a  set  of  LR(A-)  states  with  the  same  core.  Note 
that  since  CORE(ISo)  =  CORE(IS;t),  and  that  GOTO^ip,  A'),  for  alJ  A-  >  0.  depends  only 
on  the  core  of  p,  then  each  LR(A-)  state  corresponds  to  an  LR(0)  state  with  the  same  core. 
In  other  words.  URCORE/;(p)  5^  0,  for  all  A-  >  0  and  p  e  Mq. 

Definition  2.1.5  Let  p  e  Mo.  then 

LALRk(p,[A  -  Q  ■  /3])  =  |J{LR,(9,  [A  ^  a  ■  0])  \  g  e  URCORE,(p)} 


Definition  2.1.6  A  grammar  G  is  said  to  be  LALR(k),  k  >  0  if  for  all  p  £  Mo  and  for 
all  distinct  items  [A  —^  a  ■  0]  and  [B  —  7]  in  p. 

EFFt(.;3)  Q)k  LALRtlp,  [A-Q-d])r\  LALR^.(p.  [B  -  7-])  =  0 


An  LR(0)  machine  constructed  for  a  grammar  G  is  in  fact  a  correct  parser  for  6':  i.e..  the 
language  recognized  by  LRMq  is  exactly  the  same  language  described  by  G.  However,  it 
may  be  nondeterministic  due  to  the  presence  of  one  or  more  inconsistent  states.  In  general. 
a  state  is  said  to  be  inconsistent  if  it  aUows  two  different  moves  for  a  given  lookahead  string. 
In  particular,  a  state  in  Mo  is  inconsistent  if  it  contains  two  or  more  items  and  one  of  these 
items  is  a  final  item. 

When  a  state  p  €  Mo  does  not  satisfy  the  condition  of  definition  2.1.6.  it  is  also  said 
to  be  inconsistent  (in  an  LALR(A-)  sense).  The  strings  that  are  in  the  intersection  of  the 
two  sets  are  said  to  be  in  conflict  and  they  are  called  conflict  strings.  If  ^3  ^  5  then  the 
resulting  conflicts  are  called  shift-reduci  conflicts,  otherwise,  they  are  called  rtdvct-ridvcf 
conflicts. 

Theorem  2.1.1   Let  p  ^  Mk.  then 

LR;.(p.[.4  -o-3])  =  {u'  I  u-  €  FIRSTA.(y).  5'  ^;„  'lAy  =>  -jaJy.  G0T0a(IS,.7o  )  =  p} 


Theorem  2.1.2  Let  p  e  Mq.  then 

LALRap,[>l-o-/3])=  {u-  |ii€  FIRST*(i/).  5' =>;^  -,Ay=>-,a3y. 

GOTOo(ISo.7a)  =  p} 

Theorem  2.1.3  Let  p  £  Mk  then 

\/g£  PRED(p.Q):LR/t(p,[/l-o  •/?])=  LI{k{q.[A-  -oS]) 
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Figure  2.1:  Lookahead  paths  for  final  item  [A  —  -;■]  in  state  p 

Theorem  2.1.4  Let  p  6  Ma  and  let  [A  —  -a]  z^  \S'  —  -51.%  then 

LR,{p.  [A  -  -o])  =  U{FIRST,(i )  -^  LR^iP-  [B  -  >3  ■  A-,])  \[B  -  ^  ■  A']  e  p) 


Theorems  2.1.1,  2.1.3  and  2.1.4  foUow  directly  from  the  algorithm  for  constructing  the 
canonical  collection  of  LR(A-)  items.  Theorem  2.1.2  may  be  proved  using  definition  2.1.5 
and  theorem  2.1.1. 

Note  that  definition  2.1.5  is  a  restatement  of  a  point  made  earlier  that  an  L.ALR(A-) 
parser  can  be  obtained  by  first  constructing  an  LR(A)  parser  and  then  merging  certain 
states  of  thai  LR(  A)  parser.  This  definition  tells  us  that  the  relevant  LR(A-)  states  that  are 
merged  are  those  having  the  same  core.  However,  as  was  also  stated  earlier,  this  approach 
is  very  inefficient  since  the  space  required  to  construct  an  LR(A-)  parser,  for  A-  >  0  is  usually 
prohibitive.  A  practical  approach  to  constructing  an  L.ALR(A)  parser  is  to  first  construct 
the  LR(0)  machine  and  then  resolve  conflicts  in  inconsistent  states  of  Mq  by  computing 
lookahead  strings  for  final  items  in  such  states.  The  L.A.LR;t  lookahead  of  an  item  {A  —  ^■■] 
in  a  stale  p  may  informally  be  described  as  the  set  of  terminal  strings  of  length  A  that 
may  appear  on  input  if.  during  parsing,  the  reduction  A  —  ^-  can  be  applied  in  stale  p. 
The  objective  is  to  compute  LALRkip-lA  —  ^•■])  using  LRMq.  Intuitively,  the  idea  is  to 
simulate  all  possible  steps  of  the  parser  following  the  reduction  in  order  to  determine  which 
input  strings  can  be  shifted  on  next. 

Consider  the  set  of  states  where  the  parsing  may  resume  after  a  reduction  by  a  rule 
A  —  ^  m  h  state  p;  i.e.,  the  set  PRED(p.-j).  Each  slate  q  €  PRED(p,^)  contains  the 
initial  item  \A  —  •^■]  which  was  introduced  in  q  through  closure  by  at  least  one  item  of 
the  form:  [B  —  ^  ■  A~)].  After  reducing  by  the  rule  /I  —  ^  in  p,  the  parser  musi  return 
lo  some  such  state  q  where  a  read  transition  on  .4  places  it  in  a  state  r  =  GOTOi(^.^). 
(See  Figure  2.1.)  The  lookahead  set  LALR/,.(9.(j4  —  ■-•))  consists  of  all  terminal  string>  of 


12 


length  k  that  may  appear  on  input  after  state  r  is  entered.  (The  validity  of  this  assertion 
can  be  confirmed  from  theorem  2.1.3  and  definition  2.1.5.)  This  set  can  be  divided  into 
two  subsets  as  follows: 

LALRfc(9,[^ -•-•])  = 

[j{u-eTmSTk{',)\[B-^0-A-)]eq^nd\u'\  =  k]  ■  (2.1) 

u 

[j{w^kLALRk{q,[B~  3-A-,])\[B^0-A',]£q.  we  FIRST;,(7).  W'\  <  k}  (2.2) 

The  first  subset  (2.1)  consists  of  strings  of  length  k  that  are  directly  derivable  from  a  suffix 
7  in  a  kernel  item  of  r.  If,  however,  7  =>'  w  and  \u-\  <  k  then  the  rule  B  -^  l3Af  may 
reduce  before  a  string  of  length  k  is  read.  Hence,  the  set  of  strings  following  B  after  such 
a  reduction  must  be  calculated  and  concatenated  to  all  short  strings  w.  The  second  subset 
(2.2)  consists  of  strings  that  were  composed  by  such  concatenations. 

Observe  that  for  the  case  k  =  I  the  equation  above  can  be  greatly  simplified.  The 
function  FIRSTi  only  yields  strings  of  length  1  and  perhaps  the  empty  string.  Similarly, 
the  LALRi  lookahead  sets  only  contain  strings  of  length  1.  Therefore,  the  -^i  operator  can 
be  replaced  by  a  (conditional)  union  operator  (U).  The  equation  is  rewritten  as  follows: 

lXLY{(q,[A -■-•])  = 

[j{f  €  FIRSTS,  )\[B-  3-A^]eq]-  {s]  (2.-3) 

U 

[j{LALR{q,[B-  l3-A-,]]\[B-  3-A-,]£q.  -,  =>'  s]  (2.4) 

In  [.33].  the  authors  refer  to  the  lookahead  symbols  in  the  first  subset(2.3)  of  the  above 
equation  as  spontaneous/!/ generated  lookahead  and  to  the  symbols  in  the  second  subset(2.4) 
as  propagated  lookahead.  From  now  on,  when  the  subscript  k  is  omitted  it  should  be 
assumed  to  be  1. 

2.2     Previous  Work 

In  this  section,  some  of  the  most  recently  published  algorithms  for  constructing  L.ALR 
parser  generators  [24]  [26]  [32]  [39]  are  reviewed.  Most  of  the  relevant  papers  focused 
only  on  the  case  of  A-  =  1  except  [24].  However,  that  algorithm  cannot  be  implemented  in 
practice,  because  it  requires  the  computation  of  complete  LALR;t  lookahead  sets  which,  as 
was  mentioned  earlier,  is  inherently  intractable. 

2.2.1      Kristensen  and  Madsen 

Kristensen  and  Madsen  (KM)  characterized  LALR^  in  terms  of  the  LRMq  machine  with 
the  following  two  lemmas  which  can  be  seen  as  a  summary  of  the  somewhat  informal 
discussion  at  the  end  of  the  last  section. 

Lemma  2.2.1  Let  p  ^  Mq.  then 

LALRk{p,[A-o  •/3])  =  [j{LALRk(q.[A  -  -oS])  \qe  PRED(p.o)} 
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Lemma  2.2.2  Let  p  €  A/o.  [A  —  -a]  £  p,  and  A  i^  S',  then 

LALRfc(p,  [A  -  -o])  =  U{FIRST/t(7)  e^  LALRjt(p,  [B -^  (5  ■  At])  \  [B  ~  i3  ■  A^]  e  p} 


The  two  lemmas  above  follow  directly  from  theorems  2.1.3  and  2.1.4,  and  defini- 
tion 2.1.5.  They  are  combined  to  obtain  the  next  theorem: 

Theorem  2.2.1   Let  p  €  Mo.  [A  —  o  ■  13]  €  p,  and  A  ^  S',  then 

LALRkip,  [A-a-0])  =  U{FIRST;t(^)©;tLALR,(9.  [5  -  7  •  AS])  | 

q  e  PRED(p. q)  and  [B  —  1  ■  AS]  €  q] 

Once  again,  when  k  =  I,  the  above  theorem  can  be  greatly  simplified  by  replacing  -ri 
by  a  union  operator.  Kristensen  and  Madsen  also  observed  that  the  relevant  (non-empty) 
elements  of  FIRSTi(^)  in  the  equation  of  theorem  2.2.1  can  be  computed  directly  from 
the  LRMq  machine.  They  captured  that  idea  in  the  following  constructive  definition  and 
lemma: 

Definition  2.2.1  Let  p  £  Mq.  then 

TRANS(p)    =     {a\\B  -  3-a-)]ep]  [J 

[J{TRANS(GOTOo(p,^))|  [B  -  /3  •  .4-,]  €  p  and  .4^"  r} 


Lemma  2.2.3  Let  p  £  Mo-  then 

TRA.NSlp)  =  (J{FIRSTi(/3)  \[A  -  o  ■  3]  £  KERNEL(p)}  -  {f } 

Using  the  TRANS  sets,  theorem  2.2.6  can  be  reformulated  for  the  case  A-  =  1  as  follows: 

Theorem  2.2.2  Let  p  £  Mo-  [A  —  a  ■  3]  £  p,  and  A  ji  S',  then 

LALR,(p,[>l  ~Q-3])  =  [j{L{q.A)  I  q  £  PRED(p,o)} 

whtrt 

l(q.A)    =    TRANS(GOTOo(v.>l))  U 

U{LALR,(9,[fi-7  ■AS])\  [B  -1  ■AS]£  <?  and  ^  =>'  s) 
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procedure  TRANS(q) 

TM  ;=  TM  U  {q). 
for  [B  —  o   X  /3]  6  q  loop 
if  X  €  Tthen 

LA  :=  LA  U  {X}; 
elseif  X  =>•  e  and  (GOTOo(q.  X)  ^  TM)  then 

TRANS(GOTOo(q.  X)); 
end  if; 
end  loop: 
end  TRANS; 

procedure  LALR(p,  [A  —  o    5]). 

DONE  :=  DONE  u  {(p.  [A  -  q  •  3])]: 
for  q  €  PRED(p,  a)  loop 
TM  :=  0. 

TRANS(GOTOo(q.  A)). 
for  [B  —  -,A(«]  €  q  I  (^  =>'  r  and  (q,  [B  —  7A<«])  ^  DONE  loop 

LALR(q.  [B  — -)A(^]): 
end  loop; 
end  loop; 
end  LALR; 

function  LALRi(p.  [A  —  o  ■  3]). 

DONE    =  LA   =0. 

if  A  =  S  then 
LA  :={!}; 

else  LALR(p,  [A  —  a    3]). 

end  if. 

return  LA; 
end  LALRi; 


Figure  2.2:  KM  LALR]  algorithm 
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From  theorem  2.2.2,  a  set  of  equations  defining  a  function  LALR]  from  l1tm  x  Siatt  — 
2^  is  derived.  These  equations  are  then  solved  in  order  to  compute  a  LALR(l)  lookahead 
set.  The  solution  of  interest  is  the  smallest  one  satisfying  the  equations.  The  algorithm  is 
shown  in  Figure  2.2. 

The  function  LALRj  takes  as  argument  an  LR(0)  item  /  and  a  state  p  and  returns  the 
lookahead  set  for  /  in  p.  It  invokes  two  recursive  procedures:  TRANS  and  LALR.  which 
are  used,  respectively,  to  compute  intermediate  TR.^XS  and  LALR  sets.  The  global  set 
variable  LA  in  used  to  construct  the  resulting  lookakead  set.  The  other  global  set  variables: 
DONE  and  TM  are  used  to  prevent  LALR  and  TRANS,  respectively,  from  being  visited 
more  than  once  with  the  same  argument  for  a  given  call  to  LALRj. 

This  dgorithm  is  straightforward  and  fairly  efficient  for  computing  the  lookahead  set 
for  a  single  item  in  a  given  state.  However,  when  it  is  used  to  compute  the  lookahead  set 
for  many  final  items  in  an  LR(0)  automaton,  as  is  necessary  for  constructing  an  L.\LR(1) 
parser,  much  time  is  spent  recomputing  the  same  intermediate  lookahead  sets.  One  way 
to  avoid  these  recomputations  is  to  save  all  lookahead  sets  that  are  computed.  However, 
this  approach  would  not  only  require  much  space,  but  it  cannot  easily  be  incorporated  in 
this  framework  because  of  the  ^/o6o/ nature  of  the  algorithm.  A  more  reasonable  approach 
(space-wise)  would  be  to  save  the  intermediate  L(q.A)  sets,  but  once  again,  it  is  not  clear 
how  the  computation  of  these  sets  can  be  factored  out  in  the  above  algorithm. 

2,2.2     DeRemer  and  Penello 

The  L.ALR(l)  lookahead  algorithm  of  DeRemer  and  Penello  (DP)  is  superior  to  the  KM 
algorithm  in  that  it  avoids  duplicating  the  computation  of  certain  intermediate  sets,  called 
FOLLOW,  which  are  analogous  to  the  L  sets  of  theorem  2.1.2.  Additional  spare  is  required 
to  save  these  FOLLOW  sets,  but  the  number  of  such  sets  needed  is  bounded  b\  the  number 
of  nonterminal  transitions  in  GOTOo  which  is  usually  not  very  large. 

In  1977,  Eve  and  Kurki-Suonio  (EK )  [14]  presented  an  efficient  algorithm  for  computing 
the  transitive  closure  of  an  arbitrary  relation  based  on  Tarjan's  algorithm  [7]  for  finding 
strongly  connected  components  in  a  directed  graph.  Recall  that  a  strongly  connected 
component ( sec )  of  a  directed  graph  is  a  maximal  set  of  vertices  in  which  there  is  a  path 
from  any  one  vertex  in  the  set  to  any  other  vertex  in  the  set.  An  SCC  consisting  of  a 
single  node  with  no  path  to  itself  is  said  to  be  trivial.  The  main  contributions  of  DeRemer 
and  Penello  was  in  showing  how  the  EK  algorithm  can  be  adapted  to  efficiently  compute 
a  recursively  defined  set-valued  function,  and,  in  particular,  how  to  apply  that  algorithm 
to  the  computation  of  L.ALR(l)  lookahead  sets.  The  EK  algorithm  will  be  described  later, 
in  detail:  but  first,  the  fundamental  DP  definitions  and  theorems  are  reviewed. 

Definition  2.2.2  (p,  ^)  reads  (r.C)  iff  GOTOo(p.  >1)  =  r,  GOTOo('-,0  is  defined  and 
C  =>■  5. 

The  reads  relation  relates  certain  states  of  Mq  in  the  same  way  as  states  were  related 
in  definition  2.2.1  of  TR.ANS  sets.  In  fact,  DP  also  define  RE.AD  sets  which  are  analogous 
to  the  TR.A.NS  sets  in  the  following  way: 

READ(9,^)=  TRANS(GOTOo(</,/l)) 
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The  following  theorem  which  is  similar  to  definition  2.2.1  is  also  presented: 

Theorem  2.2.3 

READ(p,A)        =        DR(p,^)  U  |J{READ(r.C)|(p,>l)reads(r,C)) 
where 
DR(pM)        =        {a  €  T|  GOTOo(p,>la)  is  defined} 

Definition  2.2.3  (p,/l)  includes  (p',B)  iff  5  ^  l3A-^,  7  =>*  £,  and  GOTOo(p',/3)  =  p. 

The  includes  relation  relates  state-nonterminal  pairs  in  the  same  way  as  they  were 
related  in  theorem  2.2.2  by  the  L  equation. 

If  a  state  p  contains  a  transition  on  a  nonterminal  symbol  A,  then  it  also  contains  at 
least  one  item  of  the  form  [A  —  -u]  which  was  introduced  by  closure.  The  FOLLOW  set 
of  a  state  p  and  a  nonterminal  A  is  defined  as  follows: 

F0LL0\V(p,.4)  =  LALRi(p.[.4  -  -u.-]) 

In  other  words,  FOLLOW(p, ^4)  is  the  set  of  all  terminal  symbols  that  may  appear  on 
input  after  a  transition  on  A  in  state  p.  Similarly,  the  LALR(l)  lookahead  set  for  a  state 
q  and  a  final  item  [A  —  uj-]  is  captured  by  L.-^  sets  which  are  defined  as  follows: 

LA(q.[A  -  u;-])  =  LALRi(9.[/l  -  uj-]) 

The  next  two  theorems  of  DP  show  how  FOLLOW"  and  L.\  sets  can  be  computed: 
Theorem  2.2.4 

FOLLO\V(p,^)  =  READ(p,^)  U  (J{FOLLO\V(p'.£)  |  (p.  .4)  includes  (p',  5)} 

Theorem  2.2.5 

LA(9,/l-w--)  =  U{FOLLO\V(p,.4)  |p€  PRED(^,..-)} 


By  combining  theorems  2.2.5  and  2.2.4  and  definition  2.2.3,  one  ob.serves  that  the 
FOLLOW  sets  described  above  are  the  same  as  the  L  sets  of  theorem  2.2.2  and  the  L.A 
sets  of  theorem  2.2. -5  are  aisothe  same  as  the  LALR]  sets  of  therorem  2.2.2.  except  that 
the  breakdown  of  the  components  is  done  differently. 

This  framework  has  many  advantages.  From  a  practical  point  of  view,  each  intermedi- 
ate FOLLOW  sets  can  be  computed  once  and  saved  for  later  use.  As  these  intermediate 
sets  are  compiited,  their  content  can  also  reveal  certciin  facts  about  the  underlying  gram- 
mar. DeRemer  and  Penello  proved  that  when  the  reads  relation  contains  one  or  more 
cycles,  the  underlying  grammar  is  not  LR(A),  for  any  k.  They  also  conjectured  that  given 
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(p,A),  a  nonterminal  transition  that  is  in  a  nontrivial  SCC  of  the  digraph  induced  by  the 
includes  relation,  the  corresponding  grammar  is  not  LR(A),  for  any  k  if  READ(p.  A)  ^  III. 
This  conjecture  was  later  proved  by  Sager  [35]. 

From  the  above  theorems,  one  observes  that  the  problem  of  computing  LALR(  1 )  looka- 
head  sets  has  been  decomposed  into  four  separate  computations.  In  reverse  order,  they 
are  as  follows:  LA  sets  are  computed  from  FOLLOW  sets  of  nonterminal  transitions; 
FOLLOW  sets  are  computed  from  READ  sets  of  nonterminal  transitions;  READ  sets  are 
computed  from  DR  (Direct  Read)  sets;  and  DR  sets  are  computed  by  inspecting  certain 
transitions  of  GOTOq.  In  addition,  two  relations:  reads  and  includes  (defined  on  non- 
terminal transitions),  relate  the  READ  sets  and  the  FOLLOW  sets,  respectively. 

The  way  in  which  FOLLOW,  READ  and  DR  are  related  makes  it  possible  for  an 
appropriate  graph  traversal  algorithm  to  be  applied  to  compute  READ  from  DR  and 
FOLLO\\'  from  READ.  The  two  graphs  of  interest  are  those  induced  by  the  relations 
reads  and  includes.  The  general  case  of  this  problem  can  be  stated  as  follows. 

Let  i?  be  a  relation  on  a  set  A'.  Let  F  and  F'  be  set -valued  function  such  that  for  all 
X  €  -V, 

F{x)  =  F'ix]  U  [j{F(y)\xRy] 

where  F'  is  given  for  all  J  €  A'. 

If  the  underlying  graph  induced  by  the  relation  R  contains  no  cycles,  a  straightforward 
recursive  algorithm  can  efficiently  compute  F.  If.  however,  the  graph  contains  SCC"s.  a 
recursive  algorithm  such  as  the  one  advocated  by  Kristensen  and  Madsen  can  be  relatively 
inefficient  since  it  involves  multiple  traversal  of  the  SCC"s.  If  x  and  y  are  members  of 
an  SCC  then  xR'y  and  yR'x.  It  follows  from  that  observation  that  Fix)  C  F{y)  and 
F{y}  C  F(x):  hence.  F(x)  =  Fiy). 

The  algorithm  proposed  by  DeRemer  and  Peiiello  for  computing  F  can  be  seen  as 
accomphshing  two  tasks.  The  first  task  is  to  construct  a  new  digraph  by  collapsing  each 
set  of  nodes  making  up  a  non-trivial  SCC  into  a  single  suptmode  and  for  each  supernode 
set  a.  let  F'(a)  =  [j{F'{x)  |  x  €  o).  The  new  digraph  so  constructed,  contains  no  cycles. 
The  second  task  is  to  traverse  the  new  digraph  in  a  straightforward  recursive  fashion  to 
compute  F  on  the  supernodes.  then  propagate  the  value  of  F{a)  to  each  node  j  that  was 
collapsed  into  a.  The  striking  efficiency  of  the  algorithm  is  due  to  the  fart  that  these 
two  objectives  are  achieved  in  a  single  traversal  of  the  original  digraph  without  having  to 
explicitly  construct  the  collapsed  graph. 

Let  5  be  an  initially  empty  global  stack  of  elements  of  A'  (the  size  of  5  will  never  e.\ceed 
|A  I ).  Let  A  be  a  global  mapping  from  each  element  of  A'  into  a  non-negative  number.  Let 
F  and  F'  be  global  set-valued  maps  defined  as  above,  where  F'  is  precomputed  or  it  can 
be  easily  computed  on  the  fiy.  The  DP  digraph  algorithm  is  stated  in  Figure  2.3. 

TRAVERSE  is  a  recursive  procedure  that  takes  two  arguments:  a  node  x  £  A',  and  an 
integer  variable  d  which  represents  the  depth  of  the  recursion  at  whicli  it  was  invoked  for 
node  X.  Initially,  the  global  map  .V  is  initialized  to  0  for  each  element  of  j  indicating  that 
F{t)  has  not  yet  been  computed.  When  1  <  N(x)  <  oc.  it  indicates  that  the  computation 
of /"(j)  is  in  progress.  When  A'(i)=  oc  it  indicates  that  /~(j)  has  already  been  computed. 

Upon  entering  TR.AX'ERSE  for  a  given  node  i,  x  is  pushed  into  the  global  stack  S.  .V(j) 
is  set  to  the  depth  of  the  recursion  and  /"(  j)  is  initialized  with  F'{x).  Next,  the  algorithm 
loops  through  the  set  of  elements  y  related  to  x,  and  for  each  such  y.  if  the  computation 
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proc  TRAVERSE(x,  d) 

S  :=  S  +  [x]; 
(\l(x)  :=  d; 
F(x)  :=  F(x); 
for  y  G  X  I  X  R  y  loop 
if  N(y)  =  Q  then 

TRAVERSE(y,  d+1): 
end  if, 

N(x)  :-  N(x)  min  N(y); 
F(x)  :=  F(x)  U  F(y); 
end  loop, 
if  N(x)  =  d  then 

until  y  =  x  loop 
pop  y  from  S; 
F(y)  :=  F(x); 
N(y)  :=  oc: 
end  loop, 
end  if. 
end  TRAVERSE; 

N  :=  0, 

S   =[]: 

for  x  €  X  I  N(x)  =  n  loop 

TRAVERSE(x.  1); 
end  loop. 


Figure  2.3:  Digraph  algorithm 
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of  F(y)  had  not  yet  been  initiated,  TRAVERSE  is  invoked  recursively  to  compute  it.  If 
the  computation  of  F(y)  is  already  in  progress  (as  indicated  by  N(y)  <  N{x)).  then  the 
relation  R  contains  a  cycle  which  includes  x  ajid  y.  In  that  case  N{y)  is  assigned  to  N{x) 
to  indicate  that  the  node  y  was  traversed  first  and  that  the  computation  of  F(t)  cannot 
be  completed  until  F(y)  is  completed.  In  any  case,  for  each  y  such  that  xRy.  the  elements 
of  F{y),  which  must  include  (at  least)  the  set  F'iy),  are  added  to  the  set  F{x). 

Upon  exiting  the  loop,  if  A'(i)  ^  d  the  procedure  TRAVERSE  exits,  leaving  i  in  the 
stack  S.  On  the  other  hand,  if  N(x)  =  d,  this  indicates  that  x  is  the  very  first  element  of 
an  sec  that  was  traversed  and  jV(a-)  has  the  lowest  value  of  all  I^'iy)  for  some  element  y 
in  the  SCC.  When  TRAVERSE  returns  to  the  first  element,  x,  of  an  SCC,  all  the  elements 
of  the  SCC  are  on  top  of  the  stack  S  with  x  at  the  bottom  of  the  pile  and  the  computation 
of  F(x)  is  complete.  The  algorithm  then  proceeds  with  its  final  step  by  popping,  in  turn, 
each  element  y  of  the  SCC  from  S.  setting  F(y)  =  F(x)  and  setting  A(j/)  =  oc. 

To  compute  the  READ  sets  using  the  digraph  algorithm,  let  A'  be  the  set  of  state  and 
nonterminal  pairs  {p.  A)  in  the  domziin  of  GOTOq;  let  F'  be  the  function  DR;  and  let  R 
be  the  reads  relation.  To  compute  the  FOLLOW  sets,  let  A'  be  the  same  set  of  pairs  as 
for  the  RE.\D  sets;  let  F'  be  the  RE.'^D  map:  and  let  R  be  the  includes  relation. 

As  can  be  observed  from  the  two  algorithms  discussed  so  far.  tdgorithms  for  computing 
L.-^LRil)  lookahead  sets  are  dominated  by  union  operations.  Using  the  number  of  union 
operations  performed  as  the  criterion  for  measuring  the  time  efficiency  of  such  algorithms, 
the  digraph  algorithm  discussed  in  this  section  is  faster  than  the  KM  algorithm.  Given 
a  digraph  with  n  nodes  and  m  edges,  its  worst-case  running  time  is  0(n  +  m).  Using 
this  method  to  compute  all  the  L.A.LR(1)  lookahead  sets  for  a  given  LR(0)  automaton, 
the  total  number  of  union  operations  required  is  equal  to  the  sum  of  the  number  of  edges 
in  the  digraphs  induced  by  the  reads  and  includes  relations  for  the  automaton,  plus 
the  number  of  union  operations  required  to  compute  the  final  LA  sets.  This  approach 
also  has  the  advantage  of  being  incremental:  i.e..  a  given  FOLLOW  set  does  not  have  to 
be  computed  unless  it  is  required  to  compute  a  final  lookahead  set  or  it  is  related  (via 
includes)  to  another  FOLLOW  set  that  is  required.  In  practice,  this  feature  is  useful 
since  not  all  lookahead  sets  need  to  be  computed  in  order  to  resolve  conflicts. 

2.2.3     Park,  Choe  and  Chang 

Park.  Choe  and  Chang  sought  to  reduce  the  size  of  the  graph  induced  by  the  includes 
relation  as  proposed  by  DeRemer  and  Penello  by  eliminating  vertices  and  edges  associated 
with  closure  items.  The  main  idea  behind  their  approach  was  to  precompute  tlie  lookahead 
contribution  of  the  FOLLOW  sets  associated  with  such  items  directly  from  the  grammar 
using  a  kft  dtpt ndtncy  relal'ion  L  C  .V  x  A',  defined  as  follows: 

BLC     iff     B  -  C;3eP 

The  digraph  induced  by  the  L  relation  is  called  the  l-graph.  Each  edge  (B.C)  of  the 
l-graph  is  labeled  with  the  suffix  P. 

Definition  2.2.4 

¥AT}ik{B.C)  =  \J{FlRSJk{t3r,...d23i)    |    Bo  =  B.  B„  =  C,  v  >  0. 

B,  -  £.+,d.+i  €  P.  0  <  i<  n} 
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where  the  sequence  /?!.../?„  is  the  concatenation  of  the  edge  labels  of  the  path  (Bq,  ...,5„) 
in  the  L -graph. 

PATHAr(5,C)  is  a  subset  of  the  lookahead  strings  or  initial  prefixes  thereof  that  will 
appear  after  a  nonterminal  C  when  an  item  [A  ^  a  ■  B3]  is  in  a  given  state  and  B  =>'  C'/. 
Note  that  the  definition  of  PATHfc  is  independent  of  the  LR  state  containing  these  items. 

Consider  an  operator  6,  a  mapping  in  the  power  set  of  LR(^-)  items,  defined  as  follows: 

Definition  2.2.5  Let  [A  ^  a  •  A'/3.u]  be  an  LR(k)  item:  then. 

S({[A  -  a  ■  Xl3,  u]}}=  {[X  -^  •^,  r]  I  r  €  FIRSTfc(/3ii),  X  -^  u  €  P} 

The  reflexive  transitive  closure  of  S  can  be  used  to  define  the  CLOSURE  operation  on  an 
LR(A')  item  as  follows: 

CL0SURE({[/1  -  Q  •  X(3,  u]})  =  6-{{[A  -  a  ■  XI3,  u]]) 

Clearly,  these  definitions  can  be  extended  to  a  set  of  LR(A-)  items  containing  more  than 
one  element.  Therefore,  given  the  kernel  of  a  state  p  in  a  canonical  collection  of  LR(A-) 
items,  the  state  itself  can  be  characterized  using  b,  as  foUows: 

p=KER.\EL(p)  U  ^+(KERNEL(p)) 

where  the  two  sets  KERN'EL(p)  and  ^"'"(KERNEL(p))  are  disjoint. 

Given  an  LR(A-)  kernel  item  [A  —  a  ■  B3.  u].  PCC  proved  the  following  lemma  which 
relates  the  kernel  item  in  question  to  all  the  closure  items  that  it  can  introduce: 

Lemma  2.2.4 

b+({[A-Q-B3.-u]])  =  {[C  -  -^.rll^L-C. 

r  €  VkTWkiBX)  ek  FIRST;,(^u),  C  -  i  e  P) 

From  the  above  lemma,  one  observes  that  the  set  of  LR(A-)  lookahead  strings  associated 
with  each  closure  item  can  be  expressed  in  terms  of  the  P.^TH  function  and  the  suffix  and 
lookahead  set  of  the  kernel  item  from  which  the  closure  item  in  question  was  derived.  Thus, 
with  this  formalism,  the  computation  of  lookahead  sets  does  not  require  steps  involving 
intermediate  closure  items  since  their  contribution  is  effectively  captured  by  the  PATH/, 
sets.  Combining  definition  2.1. -5  with  lemma  2.2.4.  one  can  conclude  that  the  same  equation 
holds  for  the  L.A.LR(A-)  case,  since  P.-\TH/,-  does  not  depend  on  states.  The  following 
theorem  summarizes  this  fact: 

Theorem  2.2.6  Let  p  £  A/o,  [A  —  Oi  ■  02]  €  P,  and  A  i^  S':  then 

LAa.([.4-o, -02].?)  = 

{u\v£  ?ATEk(A\A)ek  FIRST,(J2)e;t  LAk{\B  -  /?,  ■  A';32].<il 
<?€PRED(p.Q,),  A'L-  A,[B  -  (3^  -A'^i]  €  KERNEL(^)} 

Even  though  their  formadism  is  presented  for  the  general  case.  Park.  Choe  and  Chang 
only  considered  the  ciise  of  A-  =  1  when  describing  their  algorithm.  I'sing  theorem  2.2.6. 
they  derived  the  following  constructive  definition  of  L.\LR(1)  sets: 
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LA([/1-Qi  ■Q2],P)=  {a\a€  PATH(/1',/1)  ^  F0LL0\V(9,  yl'), 

q  €  PRED(p,Qi),  A'L-A.  [B  -  l3^  ■  A'32]  €  KERNEK?)} 

where 

FOLlO\\'(q,A')=  {a\ae  FIRST(/32)  ®  PATH(5',  5)  ©  FOLLOW(r,B'), 

[B  ^  f3i-  A'f32]  e  KERNEL(9).  r  €  PRED(9.  3i ). 
[C  -11-  5'72]  €  KERNEL{r).  FL'^} 

As  was  observed  before,  in  the  case  Ar  =  1,  the  ^  operation  can  be  replaced  by  a 
conditional  union  operation.  Once  again,  note  that  the  only  items  involved  in  the  above 
equations  are  kernel  items. 

The  FOLLOW  sets  described  above  are  not  the  same  as  those  of  DeRemer  and  PenelJo. 
Note,  for  example,  that  it  may  be  the  case  that  READ(9.  A')  g  FOLLOWi^.  .4')  if  state  q 
also  contains  a  closure  item  with  A'  as  its  dot  symbol.  But,  in  such  a  case,  the  contribution 
of  these  closure  lookahead  symbols  is  captured  by  PATH(.4',i4)  in  the  LA  equation.  This 
distinction  is  not  made  clear  in  the  PCC  paper.  In  fact,  the  definition  the  authors  give  for 
FOLLO\N'  (definition  5.1  in  [32])  is  exactly  that  of  DeRemer  and  Penello.  but  in  proving 
their  result,  they  use  a  different  definition  though  they  refer  to  it  as  the  same  definition. 

Note  that  the  domain  of  the  FOLLOW  map  described  above  is  limited  to  pairs  (^..4') 
where  the  state  q  in  question  contziins  at  least  one  kernel  item  whose  dot  symbol  is  A'. 

The  PCC  algorithm  is  also  implemented  in  stages  just  like  the  DP  algorithm.  Firstly, 
the  i-graph  is  constructed,  then  the  refle.xive  transitive  closure  L"  of  L  is  computed  (using 
the  digraph  algorithm).  Next,  L'  is  used  to  compute  the  necessary  PATH  sets.  Then,  the 
necessary  FOLLOW  sets  are  computed  using  the  digraph  algorithm.  Finally,  the  PATH 
and  FOLLO\N  sets  are  used  to  compute  the  L.^  sets. 

Park.  Choe  and  Chang  claim  that  the  efficiency  of  their  approach  is  due  to  the  "order 
of  magnitude  reduction"  in  the  number  of  FOLLOW  sets  that  are  computed  using  their 
formalism.  They  also  claim  that  though  this  saving  is  partially  offset  by  the  calculation  of 
the  PATH  sets,  in  general,  far  fewer  of  these  sets  are  required  than  FOLLOW  sets  since 
the  former  depend  only  on  nonterminals,  in  contrast  to  the  latter,  involving  states. 

This  author  was  not  able  to  substantiate  these  claims.  In  fact,  experiments  with 
this  method  were  consistently  outperformed  in  time  and  storage  utilization  by  the  DP 
algorithm.  In  addition,  the  PCC  approach  has  the  major  disadvantage  that  it  cannot  be 
implemented  increment  jilly  in  a  way  that  distributes  the  cost  of  computing  each  lookahead 
set  somewhat  uniformly.  In  other  words,  the  most  costly  part  of  the  computation  is  factored 
out  by  the  construction  of  the  I-graph  and  the  calculation  of  the  P.A.TH  sets  which  must 
be  done  globally.  It  is  almost  never  the  case  that  lookahead  sets  must  be  computed  for  all 
final  items  in  order  to  resolve  conflicts.  Therefore,  these  global  calculations  are  a  waste 
of  time  and  storage  since  they  otherwise  serve  no  other  useful  purpose  (such  as  helping 
in  the  identification  of  certain  non-LR(A')  grammars).  Furthermore,  the  algorithm  also 
requires  the  computation  of  FIRST  for  certain  suffixes.  PCC  do  not  make  clear  how  these 
sets  should  be  computed  and  discount  their  impact  on  the  overall  cost  of  the  algorithm 
by  making  the  erroneous  statement  that  the  "computation  of  FIRST  is  also  required  in 
constructing  LR(0)  states". 
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2.2.4     Bermudez  and  Logothetis 

The  three  algorithms  discussed,  so  far,  have  had  the  same  general  flavor:  construct  one 
or  more  grammar-based  maps  such  as  FIRST  and  PATH,  construct  the  LR{0)  machine, 
construct  intermediate  sets  such  as  TR.A.NS,  READ  and  FOLLOW,  and  use  them  to 
compute  lookahead  sets  for  final  items.  The  algorithm  of  Bermudez  and  Logothetis  (BL) 
is  much  simpler.  It  is  similar  to  the  DP  approach  in  that  the  same  intermediate  FOLLOW 
sets  are  computed,  but  their  strategv'  is  radically  different.  Before  getting  into  the  details 
of  the  BL  algorithm,  let's  first  review  the  definition  of  SLR  parsers  and  how  they  are 
constructed. 

In  an  SLR  parser,  the  lookahead  set  for  a  final  item  is  computed  as  the  traditional 
FOLLOW  set  of  terminal  symbols  that  may  foUow  the  left  side  nonterminal  in  some  sen- 
tential form.  To  avoid  confusion,  these  sets  will  be  referred  to  as  SLR.FOLLOW  sets. 

Definition  2.2.6 

SLR_F0LL0W(.4)  =  {a  \  S  ^'  aAaP} 

Usually.  SLR.FOLLOW  sets  are  computed  directly  from  the  grammar  using  a  simple 
iterative  algorithm  [33]  [38].  but  they  can  also  be  computed  using  the  digrai)h  algorithm 
described  earlier  [41].  For  a  final  item  [A  —  ^•]  in  a  state  q,  the  SLR  lookahead  set  is 
defined  as: 

SLR-LA(^.[/1- u;])  =  SLR_F0LL0W(.4) 

Note  from  the  equation  above  that  the  computation  of  an  SLR  lookahead  set  does  not 
depend  on  the  state.  The  symbols  in  the  L.\LR(1)  lookahead  set  for  [.4  —  ~>]  in  stale  q, 
also  "follow"  nonterminal  A,  but  they  do  so  in  the  context  of  state  q  as  can  be  observed 
in  theorem  2. 2. -5. 

The  technique  of  Bermudez  and  Logothetis  consists  of  constructing  a  new  grammar.  G', 
which  captures  the  "contextual  dependency"  of  L.\LR(1 )  FOLLOW  sets  in  such  a  way  that 
the  FOLLOW  set  for  each  state-nonterminal  pair  in  the  domain  of  GOTOq  corresponds 
to  the  SLR.FOLLOW  set  of  a  nonterminal  in  6". 

Recall  the  following  property  of  LRMq:  for  every  state-nonterminal  pair  (pi.  A]  in  the 
domain  of  GOTOq  and  for  every  production  A  —  ^'  £  P,  there  exists  a  state  r  such  that 
GOTOotpi.--')  =  T  and  [A  —  w-]  €  r. 

The  new  grammar  G'  is  constructed  in  such  a  way  that  its  vocabulary,  V.  consists  of  all 
pairs  in  the  domciin  of  GOTO^.  For  each  each  pair  [;>i..4]  corresponding  to  a  nonterminal 
transition  and  each  rule  A  —  -.-in  the  original  grammar  G.  G'  contains  one  production 
for  each  path  corresponding  to  GOTOf  (pi.^-)-  The  left  side  of  the  production  is  the  pair 
[;>!.  A].  The  right-hand  side  of  the  production  consists  of  the  set  of  pairs  corresponding  to 
the  transitions  on  the  symbols  of-;:  i.e..  for  each  -•  =  .ViA'2...A„,  assume  (WLOG)  that 
GOTOq  (Pi.  A',)  =  p,+i.  1  <  >  <  n;  then  G"  contains  the  following  production: 

\piM]  —  bi--^'i]b2--V2]-[p7,.-V„] 

In  particular,  if  w-  =  f.  then  [pj./l]  —  f  is  also  in  6".  Thus,  grammar  G'  is  similar  to  G. 
except  for  the  fact  that  a  certain  amount  of  symbol  splitting  has  taken  place  during  the 
construction  of  LRMq.  The  definition  of  G'  can  be  formalized  as  follows: 
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Definition  2.2.7  For  a  context-free  grammar  G  =  (.V,7,  P,5),  let  G'  =  (A'.T'.P'.  S'). 
where 

A"  =  {[p,A]\  GOTO^(p,A)  's  defined} 

T   =  {[p.a]\  GOTO§(p,a)  is  defined} 

S'   =[ISo.5] 

P'  =  {[p,,A]  ~  [p,,X^][p2,X^]...[p„,Xr,]  I  [p,.A]  €  A''. 

\p,,X,]  €  N'UT  for  1  <J  <n, 
GOTO^(p,.A-,)  =  p,+i  forl<i<n 
and  A  -  A'i.V2...A'„  6  P} 

In  [33],  the  authors  suggest  that  SLR.FOLLOW  for  aU  nonterminals  A  in  &  given 
grammar  G  =  (A'.T",  5,  P)  be  computed  by  ''applying  the  foDowing  rules  until  nothing  can 
be  added  to  any  [SLR.]FOLLO\V  set": 

1.  Place  ±  in  SLR.F0LL0W(5) 

2.  If  there  is  a  production  A  —  aB3,  then  everything  in  FIRST(/3)  except  for  €  is 
placed  in  SLR_FOLLOVV(fi). 

3.  If  there  is  a  production   A  —   qB,  or  a  production  A   —   qBJ  where  FIRST{J) 
contains  e,  then  everything  in  SLR.F0LL0\V(/4)  is  in  SLR_F0LL0\V(5). 

Once  G'  has  been  constructed  and  SLR.FOLLOW  has  been  computed  for  the  non- 
terminals of  G',  the  LALR(l)  lookahead  set  for  a  given  state-item  pair  in  LRMq^  can  be 
obtained  as  follows: 

\.\(p.[A  -  ^■])  =  {u  I  [r.a]€SLR_F0LL0\V([v..4]).  q^  PRED(/j.^-)} 

The  BL  algorithm  is  simple  and  straightforward.  However,  if  implemented  with  the  above 
iterative  algorithm  for  SLR_F0LL0W,  as  suggested  in  [39].  it  will  perform  very  poorly 
compared  to  the  other  algorithms.  Fortunately,  the  digraph  adgorithm  can  also  be  used  to 
compute  the  SLR.FOLLOW  sets  efficiently.  This  will  be  discussed  in  the  next  section. 

2.3      Improvements 

In  this  sections  some  modifications  are  suggested  to  improve  the  performance  of  the  KM. 
DP  and  BL  algorithms. 

2.3.1      Improving  the  KM  algorithm 

As  pointed  out  earlier,  the  global  nature  of  the  K.M  algorithm  makes  it  difficult  to 
make  any  fundamental  change  to  that  algorithm.  However,  some  local  optimizations  are 
still  possible  that  can  greatly  reduce  the  total  number  of  union  operations  it  requires.  By 
observing  that  L.\LR,(9,[.4  —  -a])  =  L(q.A).  (e,ssentially  the  FOLLOW  sets)  Krisiensen 
and  Madsen  reformulated  the  LALR  procedure  of  Figure  2.2,  using  the  set  DONE  to  keep 
track  of  (state,  nonterminal)  pairs  instead  of  (state,  item)  pairs.  The  time  performance  of 
the  KM  algorithm  can  be  further  improved  by  precomputing.  for  each  state  q.  the  set  of 
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procedure  TRANS(q); 
TM  :=  TM  U  {  q  } 

LA  :=  LA  u  {X  I  X  €  T.  [B  —  oXpl]  e  q] 
for  [B  —  Q  X/?]  €  q  loop 

if  X  =>•  f  and  (GOTOo(q.  X)  ^  TM)  then 

TRANS(GOTOo(q,  X)); 
end  if; 
end  loop; 
end  TRANS; 

procedure  LALR^p,  [A  — >  q    /?]); 

for  q  €  PRED(p.  q)  |  [q,  A]  ^  DONE  loop 
DONE  :=  DONE  U  {[q,  A]}; 
TM  :=  0, 

TRANS(GOTOo(q.  A)); 
for  [B  —  7-A(S]  €  q  I  1^  =>*  c  loop 

LALR([B  —  7-A(^],  q); 
end  loop; 
end  loop; 
end  LALR; 


Figure  2.4:  Improved  TRANS  and  LALR  procedures 


terminals  on  which  a  transition  can  be  made  on  q  and  adding  these  terminals  to  related 
LA  sets  with  a  single  union  operation.  Figure  2.4  depicts  new  formulations  for  the  LALR 
and  TRANS  procedures  that  incorporate  these  two  changes.  If  the  LA  and  TRANS  sets 
are  implemented  as  bit-strings,  this  new  approach  can  significantly  improve  the  running 
time  of  the  KM  algorithm  since,  in  many  cases,  adding  a  single  element  to  a  bit  set  (as  in 
Figure  2.2)  can  be  as  expensive  as  unioning  two  bit  sets. 

2.3.2     Improving  the  DP  algorithm 

Earlier,  it  was  shown  that  the  READ  sets  proposed  by  DP  are  related  to  the  TRANS  sets 
proposed  by  KM  as  follows: 

READ(g,/l)  =  TRANS(GOTOo(9,>l)). 

From  this  equation,  one  observes  that  a  TRANS  set  is  defined  on  each  state  whose  incoming 
edges  are  labeled  by  a  nonterminal,  where  as  a  RE.AD  set  is  defined  on  each  each  edge 
pointing  to  such  a  state.  Therefore,  RE.^D  sets  can  be  computed  more  efficiently  by  first 
computing  the  TILANS  sets  for  the  relevant  states  and  propagating  these  sets  along  the 
incoming  edges  for  the  READ  sets. 

The  digraph  algorithm  can  be  used  instead  of  the  recursive  KM  algorithm  to  compute 
the  TR.ANS  sets.  To  do  this,  the  reads  relation  and  the  DR  function  must  be  defined  on 
states  rather  than  state-nonterminal  pairs.  The  new  definitions  follow: 

Definition  2.3.1  q  reads  r       iff      GOTOo(9..4)  =  r  and  A  =>'  t. 

2.5 


Definition  2.3.2  DR(9)  =  {a  €  T  \  [B  —  0  ■  a-;]  £  p} 

Lemma  2.3.1  TRANS(9)  =  DRiq)  U  U{TRANS(r)  |  p  reads  t-} 

Using  Lemma  2.3.1,  the  instantiation  of  the  digraph  algorithm  for  TRANS  is  straight- 
forward. Let  A'  be  the  set  of  states  p  €  A/o  such  that  the  incoming  symbol  on  which  a 
transition  is  made  into  p  is  a  nonterminal.  Let  R  and  F'  be  the  reads  relation  and  DR 
function  of  definitions  2.3.1  and  2.3.2,  respectively. 

However,  a  better  overall  approach  is  to  compute  TR.\NS  sets  from  FIRST  sets  as 
indicated  in  lemma  2.2.3.  The  relevant  FIRST  sets  for  this  purpose  are  the  sets  FIRST(/3) 
for  each  suffix  0  that  appears  after  a  nonterminal  in  a  production  of  G.  This  approach 
avoids  the  construction  of  the  digraph  induced  by  the  reads  relation.  Moreover,  given 
the  relevant  FIRST  sets,  the  FOLLOW  sets  can  also  be  computed  without  the  explicit 
construction  of  the  digraph  induced  by  the  includes  relation  (as  proposed  in  [26]).  since 
that  relation  can  be  computed  on  the  fly  in  such  a  case.  Recall  from  definition  2.2.3 
that  the  inclusion  of  two  pairs  (p.  A)  and  (p'.B)  in  the  includes  relation  is  based  on  a 
nullabilUy  test  of  a  suffix  following  a  nonterminal.  This  is  equivalent  to  te.sting  for  the 
presence  off  in  the  FIRST  set  of  the  relevant  suffix.  This  approach  for  computing  RE.\D 
and  FOLLOW  sets  was  used  succesfully  in  [41].  It  usuaDy  requires  fewer  union  operations 
that  the  standard  DP  approach  and  the  space  overhead  is  much  lower. 

2.3.3     Computation  of  FIRST  sets 

For  a  grammar  G  =  (7,  A",P,  5).  Knuth  [5]  defined  the  following  kfi-dtptndency  Te]!iUor\ 
on  grammar  symbols: 

A'  /  y   iff  A  —  A'lAj.-.AnVa  and  A,  =>+  5. 1  <  ?  <  n 

and  showed  that  for  each  nonterminal  A, 

FIRST(.4)=  {aeT  \  AI+  a]. 

It  is  this  problem  that  motivated  the  transitive  closure  algorithm  of  Eve  and  Kurki- 
Suonio  [14].  from  which  the  DP  digraph  algorithm  is  derived.  These  autors  proposed 
that  FIRST]  be  computed  for  nonterminals  in  two  steps.  In  step  one,  the  transitive  clo- 
sure of  the  left-dependency  relation  is  computed  and  for  each  SCC  all  its  nodes  are  mapped 
onto  a  single  representative  node  to  obtain  a  directed  graph  that  is  free  of  cycles.  "The  re- 
sulting graph  is  then  explored  by  Knuih's  efficient  "topological  sort"  algorithm"  to  obtain 
the  final  result. 

In  fact,  the  digraph  algorithm,  as  described  earlier,  can  be  used  to  compute  FIRST]  for 
nonterminals  in  a  single  pass  without  having  to  compute  a  transitive  closure  first.  The  set 
on  the  right  side  of  the  equation  above  can  be  viewed  as  the  union  of  two  sets:  an  initial 
set  consisting  of  terminal  symbols  a  that  left-depend  directly  on  A  and  a  set  of  terminal 
symbols  that  are  contributed  by  other  nonterminals  that  left-depend  directly  on  A.  Hence, 
the  equation  can  be  rewritten  as: 

F1RST(>1)  =  {a  eT\AI  a}u  (J{FIRST(B)  \AIB}. 
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This  new  formulation  of  FIRSTi(.4)  is  exactly  in  a  form  suitable  for  the  digraph  algorithm. 
Once,  FIRSTi  is  computed  for  each  nonterminal  in  a  grammar,  it  can  be  extended  for 
arbitrary  strings  [33]. 

2.3.4      Improving  the  BL  algorithm 

The  computation  of  the  SLR_FOLLOW  sets  is  central  to  the  BL  algorithm.  Thus,  any 
improvement  in  the  computation  of  these  sets  will  result  in  an  overall  improvement  of  the 
BL  algorithm.  From  rule  3  of  the  iterative  SLR_FOLLOW  algorithm  proposed  earlier,  one 
observes  that  certain  nonterminals  are  related  to  others  and  that  relation  determines  how 
the  corresponding  SLRJFOLLOW  sets  of  these  nonterminals  are  computed.  The  relation 
in  question,  caUed  sJncludes  here,  can  be  defined  as  follows: 

Definition  2.3.3  B  sjncludes  A       iff      A  ^  aB^i  £  P  and  0  =>*  t. 

For  each  nonterminal  B,  define  a  function  DF  (read  Directly  Follows)  on  B  as  follows: 

DF(B)     =     (J{FIRST(/^)M-oB,J€  P} -{5}  (2..5) 

DF(5)  is  simply  the  subset  of  SLR_F0LL0\V(5)  obtained  from  applying  rule  2  of  the  it- 
erative algorithm.  Given  the  sJncludes  relation  and  the  DF  function,  the  SLR.FOLLOW 
sets  can  be  written  in  equation  form  as  follows: 

SLR.F0LL0\V(5)     =     {l}.    if  B  =  5'.  Otherwise.  (2.6) 

SLR.F0LL0\V(5)     =     DF(5)  U  IJ{SLR.F0LL0\V( .4)  |  5  sjncludes /!}.  (2.7) 

With  the  above  formulation,  the  digraph  algorithm  can  be  instantiated  in  the  usual 
fashion  to  compute  the  SLR_F0LL0\V  sets.  When  this  approach  is  used,  the  computation 
of  lookahead  sets  with  the  BL  method  requires  the  same  number  of  union  operations  as 
the  improved  DP  algorithm  described  in  section  2.3.2. 

Lei  G  be  a  context-free  grammar  and  consider  the  BL  grammar  G'  —  (.^ '.  7"',  P'.  5') 
obtained  from  G.  It  is  rot  hard  to  see  that  a  nonterminal  \p.B]  €  .V  sJncludes  [q.A] 
if  and  only  if  the  corresponding  state-nonterminal  pair  ip.B)  in  the  domain  of  GOTOq' 
includes  (^..4).  Since  there  is  a  one-to-one  and  onto  correspondence  between  each  non- 
terminal [p.B]  e  A''  and  a  pair  (p.B)  in  the  domain  of  GOTOq.  the  graph  induced  by 
the  includes  relation  is  isomorphic  to  the  graph  induced  by  the  sJncludes  relation. 
Therefore,  given  the  READ  sets  for  all  nonterminal  transitions  in  GOTOq  and  the  DF 
sets  for  all  nonterminals  in  A'',  the  same  number  of  union  operations  required  to  compute 
SLR.FOLLOW  on  nonterminals  of  G"  is  required  to  compute  FOLLOW  sets  for  all  (slate, 
nonterminal)  pairs  in  the  domain  of  GOTOq  ■ 

For  each  state  g  in  Mq  whose  incoming  edges  are  labeled  by  a  nonterminal  B.  the  kernel 
set  of  q  contains  a  set  of  items  of  the  form  {A  —  aB  ■  /?].  .\s  suggested  in  section  2.3.2. 
the  TR.^NS  set  of  q  (from  which  relevant  READ  sets  are  obtained)  is  computed  as  the 
union  of  FIRST(J)  for  adl  suffix  i5  following  B  in  the  kernel  items  of  ^.  Similarly,  assume 
that  GOTOo(p.B)  =  q,  the  computation  of  DF([p  :  B])  (equation  2.5)  is  obtained  by 
forming  the  union  of  FIRST(-))  for  all  suffix  -)  following  \p  :  B\\n  P' .  Once  again,  since 


there  is  a  one-to-one  and  onto  correspondence  between  each  rule  in  G'  with  \p  :  B]  \r\ 
its  right-hand  side  and  a  pair  (p,  fi)  such  that  GOTOq  (p,  5)  =  q,  the  number  of  union 
operations  required  to  compute  the  TRANS  sets  is  the  same  as  is  required  to  compute  the 
DF  sets. 

Thus,  when  the  digraph  algorithm  is  used  to  compute  SLR.FOLLOW  sets,  the  BL 
algorithm  requires  exactly  the  same  number  of  union  operations  as  the  improved  DP 
algorithm.  However,  since  extra  time  as  weU  as  space  overhead  is  incurred  in  constructing 
G' ,  the  DP  algorithm  gives  a  better  overall  performance.  Depending  on  the  nature  of  the 
grammar,  the  extra  space  needed  for  G'  may  be  substantial. 

2.4     Remarks 

In  this  section,  four  algorithms  for  computing  LALR(  1 )  lookahead  sets  have  been  reviewed. 
The  first,  by  Kristensen  and  Madsen(KM)  [24],  is  a  straightforward  recursive  algorithm 
which  is  efficient  for  computing  an  individual  lookahead  set.  However,  in  constructing  an 
LALR(  1 )  parser,  a  large  number  of  lookahead  sets  must  be  computed  and  the  computation 
of  some  of  these  sets  usually  depend  on  others.  In  such  a  situation,  the  KM  algorithm 
performs  poorly  because  it  does  not  avoid  recomputing  any  lookahead  set.  The  second 
algorithm,  by  DeRemer  and  Penello(DP)  [26],  is  the  most  efficient  of  all  the  algorithms 
mentioned.  It  is  superior  to  the  KM  algorithm  in  that  it  avoids  recomputing  certain  in- 
termediate lookahead  sets  called  FOLLOW,  though  it  requires  (a  reasonable  amount  of) 
extra  space  in  order  to  do  so.  The  DP  framework  also  allows  one  to  detect,  in  certain 
cases,  whether  or  not  the  underlying  grammar  is  not  LR(k)  for  any  k.  (In  general,  this 
problem  is  undecidable.)  The  third  algorithm,  by  Park,  Choe  and  Chang(PC'C)  [32],  is 
less  time-  and  space-efficient  than  the  DP  algorithm,  notwithstanding  some  experimental 
comparisons  presented  by  these  authors  wjiich  would  indicate  otherwise.  Tiie  last  algo- 
rithm, by  Bermudez  and  Logothetis(BL)  [39],  is  the  simplest  of  all  the  algorithms.  A  new 
grammar  G'  is  derived  from  the  LRMq  machine  for  the  original  grammar  G.  in  such  a 
way  that  the  same  intermediate  FOLLOW  sets  advocated  by  DP  can  be  computed  more 
easily  from  G' .  When  implemented  as  suggested  in  the  previous  section,  the  BL  and  DP 
algorithms  have  essentially  the  same  running  time  performance  except  for  the  overhead 
incurred  in  constructing  the  G"  grammar.  Experimental  results  obtained  from  using  each 
of  these  algorithms  are  presented  in  Appendix  A. 


28 


Chapter  3 

LALR(A:)  Lookahead  Sets  with 
Varying-Length  Strings 


Perhaps  the  most  important  reason  for  using  LALR(/:)  grammars  is  that  they  allow  the 
user  to  express  certain  syntactic  constructs  in  a  more  intuitive  manner.  To  motivate 
the  importance  of  this  observation,  consider  the  grammar  of  Figure  3.1(a).  This  very 
simple  definition,  which  clearly  captures  the  syntactic  rules  of  BNF,  is  LALR(2).  To  see 
this,  observe  that  when  parsing  the  right-hand  side  of  a  rule,  one  cannot  determine  when 
looking  at  a  symbol  whether  or  not  it  belongs  to  the  current  rule,  or  if  it  is  the  left-hand 
side  svmbol  of  the  next  rule. 
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Figure  3.1:  L.^LR  grammars  for  BNF 


The  grammar  of  Figure  3.1(b)  is  an  LALR(l)  grammar  for  the  same  language,  but  note 
that  the  rules  of  that  grammar  do  not  accurately  reflect  the  syntactic  structure  of  a  BNF 
rule.  As  a  result  when  using  a  parser  generated  from  that  grammar,  there  is  no  convenient 
way  of  determining  when  the  end  of  a  rule  (from  the  input)  has  been  reached. 

The  use  of  multiple  lookahead  symbols  is  also  helpful  in  detecting  certain  common  errors 
that  can  be  flagged  as  wjirnings.  For  example,  most  programming  languages  contain  "dead" 
keywords,  such  as  THEN,  and  separators,  such  as  '•;"  whose  only  purpose  is  to  separate 
syntactic  constructs.  It  is  usually  the  case  that  if  the  parser  can  look  ahead  at  more 
than  one  symbol,  then  it  can  determine  without  the  special  markers  where  a  particular 
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construct  ends  and  another  begins.  In  such  a  case,  the  markers  can  be  made  optional 
(in  the  grammar  definition)  and  a  semantic  '"warning"  message  issued  when  the  "empty" 
choice  is  reduced.  For  example,  each  occurrence  of  "THEN"  in  a  typical  programming 
language  grammar  can  be  replaced  by  a  nonterminal  "then"  which  is  defined  as  follows: 

then      -^      € 

I      THEN 

Similarly,  observe  that  the  grammar  of  Figure  3.1(a)  can  be  modified  into  the  LALR(l) 
grammar  of  Figure  3.1(c)  by  introducing  ";"  as  a  marker  symbol  to  separate  adjacent  rules. 
When  processing  the  right-hand  side  of  a  rule  with  that  grammar,  the  parser  always  shifts 
on  "s"  and  reduces  by  "rule"  upon  encountering  a  ";"  or  the  end-of-file  token. 

In  the  area  of  parser  generation,  the  main  innovation  of  this  thesis  is  a  new  algo- 
rithm for  the  construction  of  efficient  LALR(A-)  parsers  for  values  of  k  larger  than  1.  As 
noted  earlier,  for  practical  reasons,  such  a  parser  cannot  be  efficiently  constructed  in  the 
usual  fashion  (i.e..  first  construct  an  LR(0)  automaton  and  then  resolve  each  conflict  by 
computing  the  relevant  LALR(A-)  lookahead  sets).  Instead,  the  minimum  amount  of  looka- 
head  information  is  computed  in  an  incremental  fashion,  when  needed,  as  follows.  When 
conflicts  are  detected  in  an  LR(0)  state,  the  parser  generator  first  computes  the  relevant 
LALR(l)  lookahead  set  for  each  final  item  in  that  state.  If  these  lookahead  sets  are  suf- 
ficient to  resolve  the  conflicts,  the  generator  goes  no  further.  Otherwise,  each  conflict 
symbol  is  extended  into  a  set  of  lookahead  strings  of  length  2.  If  that  does  not  resolve  all 
the  remaining  conflicts,  each  conflict  string  of  length  2  is  extended  into  a  set  of  lookahead 
strings  of  length  3.  and  so  on,  until  either  the  upper  limit  k  is  reached  or  all  conflicts  are 
successfully  resolved.  At  the  end  of  this  process,  each  final  item  in  an  inconsistent  LR(0) 
state  is  associated  with  a  lookahead  set  of  strings  whose  length  vary  from  1  to  k. 

The  use  of  lookahead  sets  with  variable-length  strings  raises  an  important  question, 
namely,  what  is  a  good  representation  for  the  parsing  tables  in  such  a  framework?  Clearly, 
an  .ACTIO.N  matrix  whose  columns  consist  of  all  terminal  strings  of  length  1  up  to  k  would 
be  even  more  space-consuming  than  the  standard  representation.  A  suitable  representa- 
tion will  be  described  later,  but  first,  the  traditional  parsing  table  representation  and  the 
concept  of  a  dffavit  action  are  reviewed. 

Consider  the  grammars  (a)  and  (b)  of  Figure  3.1.  Figure  3.2  shows  possible  parsing 
tables  for  (a)  and  Figure  3.3  shows  possible  parsing  tables  for  (b).  Observe  that  for  an 
L.ALRf  1 )  parser  (Figure  3.3),  each  column  of  its  ACTION  matrix  is  indexable  by  a  single 
terminal  symbol.  Thus,  the  terminal  transitions  of  the  GOTO  matrix  can  be  combined 
with  the  .ACTIO.N  matrix  to  obtain  a  merged  table  as  in  Figure  3.4.  Furthermore,  note 
thai  some  states  contain  many  reduce  actions  by  the  sanie  rule.  These  actions  can  be 
factored  into  a  single  dcfavit  ndvct  action  by  adding  a  dtfauh  column  (?)  to  the  ACTIO.N 
matrix  as  shown  in  Figure  3.5.  In  general,  for  a  given  state  of  an  LALR(il-)  parser,  the  rule 
with  whidi  the  most  reduce  actions  are  associated  is  chosen  as  the  default  action  [33]. 

During  parsing,  if  the  ACTION  matrix  entry  for  a  given  (state,  symbol)  pair  is  error, 
the  default  value  associated  with  the  pair  (state.  ?)  is  used.  When  the  default  action  for 
a  given  stale  is  not  error,  it  may  allow  the  parser  to  incorrectly  perform  a  reduce  action. 
However,  in  such  a  case,  the  error  will  be  delected  later  since  the  parser  will  not  be  able 
to  shift  on  the  input  symbol  in  question. 
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Figure  3.2:  LALR(2)  parsing  tables  for  BNF  grammar  (a) 
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Figure  3.3:  L.\LR(1)  parsing  tables  for  BNF  grammar  (b) 
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Figure  3.5:  Merged  LALR(l)  parsing  tables  with  default  actions  for  grammar  (b) 


Once  again,  consider  the  grammar  of  Figure  3.1(a)  and  its  parsing  tables  (Figure  3.2). 
With  such  an  ACTION  matrix,  the  parser  always  needs  access  to  the  next  two  symbols 
in  the  input  in  order  to  make  a  parsing  decision.  However,  notice  that  the  only  time  the 
two  lookahead  symbols  are  necessary  is  when  the  parser  is  in  state  7  and  the  next  input 
symbol  is  .^.  In  that  case,  if  the  successor  of  5  is  another  .«;  or  1.  the  parser  shifts:  if  the 
successor  is  a  — .  the  parser  reduces.  (If  default  actions  are  used  in  this  table,  the  parser 
would  always  reduce  if  the  successor  is  neither  _'.  nor  ±.) 

Just  as  the  lookahead  set  for  a  L.\LR(A)  parser  can  be  constructed  in  an  incremental 
fashion,  the  parser  itself  can  "lookahead"  in  an  incremental  fashion  given  a  proper  repre- 
sentation for  the  ACTION  matrix.  That  is.  at  run  time,  the  parser  can  behave  just  like  a 
L.-\LR(1)  parser  when  only  a  single  lookahead  is  sufficient  and  use  extra  lookahead  when 
required.  This  is  achieved  by  dividing  the  traditional  ACTION  matrix  into  two  separate 
tables:  an  .•\CTIONi  matrix  indexable  by  (state,  terminal)  pairs,  just  like  an  L.'\LR(1) 
ACTION  matrix,  and  a  lookuhtud  tablt.  Each  entry  of  the  ACTION]  matrix  contains 
either  a  relevant  action  to  be  performed  (shift  q,  reduce  ?,  accept,  error)  or  it  indicates 
that  more  lookahead  is  required  (lookahead  q')  and  identifies  where  in  the  lookahead  table 
to  begin  the  search. 

.Additional  lookahead  is  required  for  an  L.A.LR(A-)  parser  when  for  a  given  state  q  and 
terminal  symbol  c,  the  ACTION  matrix  of  the  parser  contains  two  or  more  different  useful 
entries  on  lookahead  strings  that  start  with  a.  Let  the  set  of  terminals  represent  an 
alphabet.  A  deterministic  finite  automaton  (DFA)  can  be  constructed  in  a  straightforward 
manner  to  recognize  a  given  set  of  strings  on  that  alphabet.  For  example,  consider  a  set 
of  strings  L  =  {abc.adb.adc.acb]  on  the  English  alphabet.  Figure  3.6  shows  a  DFA  to 
recognize  this  set  of  strings. 

Assume  that  the  set  {a.  b.  c.  d]  is  a  subset  of  the  set  of  terminals  of  a  grammar  G  and 
that  the  set  L  above  is  a  set  of  conflicting  lookahead  strings  in  a  state  q  of  an  L.ALR(A) 
parser  for  G.  A  DFA  similar  to  the  one  in  Figure  3.6  (called  a  lookahead  DFA)  can  be 
constructed  to  recognize  the  elements  of  L.  The  initial  state  vi  of  the  lookahead  DFA  is 
the  inconsistent  state  q  of  the  parser.  The  other  states  of  the  DFA  are  new  slates  called 
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Figure  3.6:  DFA  for  {abc,  adb,  adc.  acb} 


lookahead  states  (different  from  the  lookaliead  states  of  [6]).  Each  path  in  the  DFA  from  the 
initial  state  ^i  to  some  final  stale  qj  spells  a  relevant  lookahead  string  x  in  the  conflicting 
set  and  qj  is  associated  with  the  LALR(A:)  action  of  the  pair  (q,  x).  A  transition  into  a 
lookahead  state  is  called  a  lookahead  shift. 

A  lookahead  DFA  can  be  stored  in  a  table  by  associating  each  of  its  non-final  states  with 
a  vector  (row)  indexable  by  terminal  symbols.  Each  entry  in  such  a  vector  contains  either 
a  transition  into  another  lookahead  state  q'  or  an  actual  parsing  action.  A  transition  into 
a  lookahead  state  q'  is  denoted:  lookahead  shift  q'.  The  lookahead  table  of  an  L.ALR(i-) 
parser  is  a  matrix  formed  with  the  rows  associated  with  lookahead  states. 

At  run  time,  when  the  parser  enters  state  q  (in  the  example  above),  if  the  next  input 
symbol  is  o,  the  ACTIONi  matrix  yields  a  lookahead  shift  action  to  the  successor  of  q 
on  a.  A  lookahead  shift  is  different  from  a  normal  shift  (into  an  LR(0)  stale)  in  that  it 
instructs  the  parser  to  look  at  the  next  symbol  in  the  input  without  consuming  the  current 
symbol.  Once  the  DFA  is  entered,  the  parser  tries  to  match  the  next  symbols  in  the  input 
with  a  valid  path  in  the  DFA.  If  successful,  a  final  state  qj  is  reached  and  the  action 
associated  with  qj  is  performed;  otherwise,  the  error  action  is  computed  at  the  first  illegal 
combination  of  (state,  symbol)  pair  encountered. 

Assume  that  each  string  element  of  the  set  L  is  associated  with  a  parsing  action 
in  state  q  as  follows:  ACTION(9.abc)=S2:  ACTI0N(^.abd)=R5;  ACTION(9.adc)=R6: 
ACT10N(9.acb)=R.5.  The  lookahead  DFA  for  ACTIO.Ni(^,a)  is  shown  in  Figure  3.7.  Ob- 
serve that  once  slates  93  and  95  are  entered,  the  DFA  can  only  follow  a  single  path  10  a  final 
state.  When  that  is  the  case,  if  the  action  associated  with  the  final  slate  is  a  reduce  action, 
all  states  on  that  path  can  be  removed  (except  for  the  finaJ  action).  This  is  analoguous  to 
the  default  reduce  optimization  mentioned  earlier.  Figure  3.8  shows  a  DFA  for  the  actions 
above  with  default  reductions.  When  default  actions  are  used  in  a  lookahead  DFA.  some 
of  the  paths  from  the  starling  state  to  a  fined  action  (and  the  lookahead  strings  that  spell 
these  paths)  may  be  shorter  than  k;  hence,  the  term:  lookahead  sets  with  variabU-Ungth 
strings. 

Recalling  the  earlier  informal  discussion  of  how  conflicts  are  resolved  in  ihis  method. 
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Figure  3.7:  Lookahead  DFA 


Figure  3.8:  Looliahead  DFA  with  default 
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for  each  inconsistent  state  q,  it  is  precisely  the  DFAs  with  default  actions  that  are  con- 
structed incrementally  for  each  symbol  a  that  is  in  conflict  in  q.  and  "grafted"  onto  q. 
Thus,  the  space  required  to  construct  a  LALR(A-)  parser  with  variable  lookahead  strings 
is  proportional  to  the  size  of  the  resulting  parser.  Consequently,  this  method  is  both  fast 
and  practical.  Figure  3.9  shows  the  parsing  tables  with  variable  lookahead  strings  for  the 
LALR(2)  grammar  of  Figure  3.1(a). 

3.1      LALR(/.)  Lookahead  Sets 

In  this  section,  the  fundamental  definitions  of  READ  and  FOLLOW  sets  are  extended  to 
accomodate  the  general  case  of  k  >  0. 

Definition  3.1.1 

FOLLOWo(p,^)=  {s] 
TOlLOWkip.A)  =  lAlRkip,  [A  -  -u;]) 

Lemma  3.1.1 

YOLlO\\\.(p.  A)  = 

{J{Y]RSTk{3)^kT0LL0\\k(p'.B)  \[B  -o  ■  A3]  e  p.^nd  p'  e  PRLD(p.a)] 

Leinma  3.1.2 

LALRt(</..4  -  u;-)  =  [J{FOLLO\\\.(p..i)  1  p  €  PRED(^...-)} 


Lemmas  3.1.1  and  3.1.2  follow  directly  from  the  two  Lemmas  2.2.2  and  2.2.1.  respec- 
tively, of  Kristensen  and  Madsen  stated  earlier. 

Definition  3.1.2 

READt(p.A-)  =  U{u-|  u€  FIRSTt(,i),  \u-\  =  k.  [B  ~  a  ■  X  3]  e  p] 

Definition  3.1.3 

SHORT,(p.  A)  =  [j{[u;[B  -  o  •  A^]]  1  u-  €  FIRSTjt(/J),  0  <  \u-\  <  A-.  [B  -  o  -  A,i]  €  p] 

Definition  3.1.4 

READ)c.(p.  A)  =  READ^(p,  A)U  {u-  |  [u-,[B  -  a  ■  X 3]]  €  SHORTfc(p.  A)} 
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For  a  given  state  p  and  symbol  A'.  READ;;(p,  A')  is  the  set  of  all  strings  of  length 
k  that  can  be  read  following  a  transition  on  A'  in  p;  SHORTit(p,  A')  is  the  set  of  pairs 
[w,  [B  —  Q  -A"/?]]  where  [B  —  o  ■  X  d]  is  an  item  in  the  state  p  and  u-  is  a  non-empty  string 
of  length  less  than  k  that  is  derivable  from  d:  READ/f(p,  A')  is  simply  the  unjon  of  the 
sets  FIRSTti  J)  for  all  items  [B  —  a  ■  X  J]  in  p  less  the  empty  string:  s.  The  set  of  strings 
in  READa:.(p,  A')  wiU  be  referred  to  a£  strings  of  length  k' .  Combining  Lemma  3.1.1  and 
Definitions  2.2.3,  3.1.2  and  3.1.3,  one  obtains  the  following  lemma: 

Lemma  3.1.3 

TOllO\\k(p,A)  =  RLABkip.A) 

UU{FOLLO\Vt{p',fi)  \(p.A)  includes  (p',5)} 

UU{{t''}-FOLLOW;t_l„,|(p',B)  I  [u-,[B  ^a-A3]]  €  SHORT^lp. /I),  p'  €  PRED(p.a)}. 

Lemmas  3.1.3  breaks  down  the  FOLLOVV/t  sets  into  three  disjoint  sets.  The  first  set 
is  the  set  of  strings  of  length  k  that  can  be  read  immediately  after  a  transition  in  p  on  A. 
The  second  set  consists  of  strings  of  length  k  that  are  contributed  by  other  FOLLOW;. 
sets  when  A  can  be  followed  by  a  nullable  suffix  in  p.  Finally,  the  third  set  consists  of 
strings  formed  by  the  concatenation  of  strings  of  length  less  than  k  that  are  derived  from 
a  suffix  following  ^  in  p  with  all  possible  suffixes  of  the  right  length  that  may  follow  these 
short  strings  in  the  given  context.  Lemmas  3.1.3  and  Lemma  3.1.2  show  that  (just  as  in 
the  case  of  A*  =  1 )  the  computation  of  L.\LR;  lookahead  sets  can  be  broken  down  into  the 
computation  of  smaller  components:  i.e..  L-ALR^-  sets  are  computed  using  FOLLOW^  sets 
which,  in  turn,  are  computed  using  READ;-,  sets. 

3.2      Computation  of  READ^-,  FOLLOWi  and  LALR/  Sets 

Let  f.1nck  be  a  non-empty  sequence  of  states  representing  a  path  in  the  LR(0)  automaton 
of  a  grammar  G  and  let  A  be  a  grammar  symbol  in  G.  X  ^  (  and  X  ^  S'.  A  siring 
V  is  said  to  be  readable  in  the  context  of  {slack. X).  if  Xu-  can  be  successfully  parsed 
starting  from  the  stale  p  on  top  of  stack  and  no  reduction  on  some  symbol  in  ic  ever 
causes  a  stack  vndtrflow  or  consults  the  first  element  of  stack.  For  the  remainder  of  this 
section,  a  pair  {stock. o).  where  stack  is  a  sequence  of  states.  wiU  be  loosely  referred  to 
as  a  configuration.  When  the  string  o  in  a  configuration  consists  of  a  single  lerminid.  the 
configuration  is  called  a  tcrnunal  configuration. 

Since  the  LR(0)  automaton  of  a  context-free  grammar  6".  LRMq  .  is  a  correct  parser  for 
G  (although  it  may  be  nondeterministic).  any  string  v  that  is  readable  in  the  context  of  a 

configuration  ( [pi .  P2 ,  p^].  A' )  can  be  parsed  by  executing  the  correct  sequence  of  moves 

of  the  automaton  starting  with  the  configuration  ([pi.p2 p„.q].ti-).  where  ^  is  the  state 

that  is  entered  after  a  transition  on  A"  in  p„.  In  slate  q.  there  are  three  possibilities  to 
consider: 

1.  State  q  contains  transitions  on  terminal  symbols.  Each  terminal  symbol  a  that  is 
directly  readable  in  ^  is  a  possible  starting  symbol  for  one  or  more  strings  readable 
in  the  context  of  {stack, X  ).  The  set  of  all  suffixes  of  length  {k  -  1  )■  that  can  follow 
a  is  computed,  recursively,  by  considering  the  pair  consisting  of  the  state  sequence 
slack  +  [q]  and  the  symbol  a. 
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2.  State  q  contains  transitions  on  nuUable  nonterminals.  After  entering  state  q,  the 
parser  can  make  a  transition  on  a  nuUable  nonterminal  without  consuming  any  input 
symbol.  AH  strings  of  length  k'  that  are  readable  after  such  a  transition  must  also 
be  included  in  the  final  result. 

3.  State  p  contains  one  or  more  items  of  the  form  [C  ^  7  •  A'].  For  each  such  item, 
after  making  the  transition  on  A'  into  q,  the  parser  can  immediately  reduce  by  the 
rule  C  — *  7.Y.  The  reduce  action  consists  of  popping  the  elements  of  the  stack 
corresponding  to  -yA',  making  a  transition  on  C  and  continuing  the  parse.  Once 
again,  after  executing  the  reduction,  the  parser  still  has  not  consumed  any  symbol. 
Thus,  every  string  of  length  k'  that  is  readable  in  the  new  context  must  be  considered. 
The  new  context  consists  of  the  prefix  of  the  stack  obtained  after  the  popping  of  7  A' 
and  the  symbol  C . 

From  now  on,  let  ts  denote  the  state  on  top  of  stack;  i.e.,  ts  =  stack(ij^stack).  The 
following  equations  formally  capture  the  set  of  strings  of  length  k'  that  can  be  read  in  the 
context  oi  {stack,  X  ): 

RLADo'{^^tack.X)=  {s] 
READa- (6?acA-.  A  )  = 

U{{fl).READ,t_,).(.'<'ficii-  +  k].o)  I  a  e  DR(/.5..A).  q=  GOTOo(^^.A)} 
UU{READt.(6/t/cA'  +  [<y].y)  |  [ts.X]  reads  {q.Y)}. 
U\J{READkA^^1ack{\..i^stack  -  \-\).C)\[C  —  -,  ■  X]  e  is,  |-|  +  1  <  #.s/ocA-].    • 

The  solution  set  of  interest  is  the  smallest  set  that  satisfies  the  above  equations.  The  set 
RE.AD^.(p.  A' )  is  simply  the  set  of  strings  of  length  k'  that  can  be  read  in  the  context  of  the 
configuration  ([p].A').  The  equations  above  can  be  modified  to  compute  SHORT;(/).  A' ) 
but  as  will  be  shown  later  this  set  can  be  computed  as  a  side-effect  in  an  incremental 
algoritlun  for  computing  READi-(p.  A'). 

In  practice,  it  is  not  always  possible  to  simulate  all  possible  steps  of  an  LR(0)  parser. 
as  suggested  above,  if  the  underlying  grammar  is  ambiguous.  More  precisely,  if  the  LR(0) 
automaton  contains  one  or  more  loops  labeled  with  nullable  nonterminals  or  the  grammar 
contains  rules  that  can  cause  a  derivation  of  the  form  A  =>^m  A.  the  simulation  can  get 
into  an  infinite  loop.  These  two  problems  are  illustrated  in  the  subgraphs  of  Figure  3.10. 
In  both  cases,  assume  that  C*  =>■•"  s.  In  Figure  3.10(a).  once  state  p  is  entered,  the  parser 
can.  without  consuming  any  input  symbol,  enter  state  q  and  traverse  the  loop  q..r  any 
number  of  times  before  moving  on  to  state  s.  Similarly,  in  Figure  3.10(b),  if  the  parser 
enters  state  p  where  it  recognizes  a  B  without  consuming  any  input  symbol,  it  can:  enter 
state  q.  produce  C7,  and  reduce  BC-  to  B  which  would  bring  it  back  to  state  p  where 
the  same  process  can  be  repeated  indefinitely. 

.^n  algorithm  can  be  devised  to  keep  track  of  all  configurations  already  seen,  while 
simulating  the  parser,  in  order  to  avoid  these  two  problems  [24].  However,  from  a  practi- 
cal point  of  view,  since  the  main  goal  of  computing  READ;-,  is  to  construct  an  LALR(A) 
parser,  and  since  any  occurrence  of  one  of  these  two  conditions  renders  a  grammar  not- 
LR(k)  for  any  k.  one  can  simply  ensure  that  these  two  conditions  do  not  occur  in  the 
given  grammar  and  its  automaton  before  attempting  to  compute  the  RE.\Df  sets.  From 
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Figure  3.10:  Subgraph  of  LR(0)  automata  with  parsing  cycles 


function  READ.(stack.  X,  k), 
if  k  =  0  then 

return  {r}, 
end  if. 

rd  :=  0. 

U  :=  stack(#stack): 
q  :=  GOTOo(ls.  X) 
for  a  €  DR(ts.  X)  loop 

rd    =  rd  U  {ax  |  x  €  READ.(stack+[q].  a.  k-1)}. 
end  loop, 
for  (q.  Y)  I  (ts.  X)  reads  (q.  Y)  loop 

rd  :=  rd  U  READ.(stack  +  [q]   Y.  k); 
end  loop. 
for  [C  —  ■)  ■  X]  €  U  I  C  7^  S'  and  h  I  +  1  <  #stack  loop 

rd  ■=  rd  U  READ.(stack(l    #stack  -  h|)    C.  k)) 
end  loop, 
return  rd, 
end  READ,; 


Figure  3.11:  Recursive  READ;.,  function 
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function  READJSTEP(stack,  X); 
configs  :=  0; 
ts  :=  stack(#stack); 
q  ;=  GOTOo(ts.  X); 
for  a  €  DR(U,  X)  loop 

configs  :=  configs  U  {[stack+[q],  a]); 
end  loop; 
for  (q,  Y)  I  (ts,  X)  reads  (q.  Y)  loop 

configs  :=  configs  U  READ_STEP(stack+[q],  Y); 
end  loop; 
for  [C  —  7X]  e  ts  I  C  ^  S'  and  hi  +  1  <  #stack  loop 

configs  :=  configs  U  READ_STEP(stack(l..#stack  -  h|),  C), 
end  loop; 
return  configs: 
end  READ^TEP; 

function  READ. (stack,  X,  k); 

if  k  =  0  then 
return  {£■}; 

end  if; 

rd  :=  0; 

for  [stk,  a]  e  READ-STEP(stack.  X)  loop 
rd  :=  rd  U  {ax  |  x  €  READ.(stk,  a,  k-1)} 

end  loop; 

return  rd, 
end  READ.; 


Figure  3.12:  Incremental  READ^-.  function 


now  on.  it  will  be  assumed  that  these  conditions  have  been  checked  and  that  the  gram- 
mar in  question  contains  no  nontermintd  that  can  rightmost  produce  itself  and  the  LR(0) 
automaton  contains  no  nullable  cycles.  The  algorithm  of  Figure  3.11  is  a  straightforward 
implementation  of  the  READ/.-,  equations. 

The  algorithm  of  Figure  3.12  is  an  incrementjJ  version  of  the  algorithm  of  Figure  3.11. 
Each  incremental  "step"  is  performed  by  a  function  RE.A.DJSTEP  which  given  a  configu- 
ration (."./flcA-.  A')  yields  a  set  of  new  terminal  configurations  that  can  be  reached  after  the 
parser  has  executed  the  transition  on  A'  in  the  context  of  stack.  The  READ/,,  set  for  a 
given  configuration  {slack.  X )  can  be  computed  by  making  successive  calls  to  RE.AD_STEP 
to  extend  READ/f  strings  until  they  reach  the  proper  length  or  cannot  be  extended.  This 
incremental  approach  hzis  the  added  advantage  that  it  can  be  used  to  simultaneously  com- 
pute the  set  SHORT;t(^'.  A' ).  When  invoked  with  a  pair  ([p].  A'),  if  the  incremental  READ, 
function  of  Figure  3.12  reaches  a  configuration  (slack.)')  that  is  blocked  because  its  next 
action  is  a  reduction  that  would  cause  it  to  use  s1ack{l)  or  cause  a  stack  underflow,  this 
indicates  that  state  p  contains  one  or  more  items  of  the  form  [C  —  q  •  X l3)']  where  the 
suffix  3)'  generates  at  least  one  string  whose  length  is  shorter  than  k.  (Note  that  JV  may 
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function  FOLLOW^TEP(stack,  X); 
ts  :=  stack(#stack); 
if  #stack  =  1  then 

if  [ts.  X]  e  VISITED  then 

return  0; 
end  if 

VISITED  :=  VISITED  U  {[ts.  X]}; 
end  if; 
configs  :=  0. 
q  :=  GOTOo(ts.  X): 
for  a  G  DR(ts,  X)  loop 

configs  :=  configs  U  {[stack+[q],  a]}; 
end  loop 
for  (q.  Y)  I  (ts.  X)  reads  (q.  Y)  loop 

configs  :=  configs  U  FOLLO\V^TEP(stack  +  [q],  Y). 
end  loop. 

for[C-r  X]  €  ts  I  C^^  Sloop 
if  |-|  +  1  <  #stack  then 

configs  :=  configs  U  FOLLO\V^TEP(stack(l..#stack  -  \t\),  C); 
else 

asserl  ■)  =  ■)i72  w/iere  |-)2|  +  1  =  #s1ack-; 
for  q  e  PRED(stack(l),  -)i)  loop 

configs  :=  configs  U  FOLLO\V^TEP([q].  C); 
end  loop; 
end  if 
end  loop, 
return  configs: 
end  FOLLOW^TEP; 


Figure  3.13:  FOLLOW;,  fvnciion 


actually  be  (.)  In  such  a  case,  stuck  =  [p.  pj, . . .  ,pn].  where  \X 3\  =  n  and  the  sequence  of 

states  pi pr,  are  the  states  traversed  by  the  automaton  when  executing  GOTOolp.  .V.3). 

Tlius.  knowing  the  relevant  items  \C  —  aX  dV]  €  Pr.  that  cannot  be  extended,  the  source 
items  [C  —  Q  •  A' JV]  can  be  obtained  by  simply  moving  the  "dot"  back  n  symbols. 

From  Lemma  3.1.3,  one  observes  that  the  computation  of  FOLLOW^lp. /4)  is  based 
on  the  sets  RE.\Da-(p.  .4)  and  SHORT^ip. /I).  Each  element  u-  €  RE.^Dilp.  .4).  is  added 
directly  to  FOLLO\Vi:(p. /4)  and  each  element  of  SHORTtlp, /I)  is  extended  into  a  set  of 
strings  of  length  k  which  are  then  added  to  FOLLOW'kip.  A).  To  extend  a  short  string 
ti-  €  SHORT; (p.  .4 ).  the  items  whose  suffixes  produced  u-  must  be  identified  and  each  string 
of  length  k  —  \xi-\  that  can  follov  the  left-hand  side  of  such  items,  in  the  context  of  the  state 
p.  is  appended  to  u\  Unfortunately,  these  equations  cannot  be  used  in  a  straightforward 
manner  to  derive  an  algorithm  to  compute  FOLLOW^  sets.  Once  again,  one  has  to  avoid 
getting  into  an  infinite  loop.  Just  as  in  the  case  oi  k  =  1,  this  may  occur  when  the  includes 
relation  contains  one  or  more  cycles.  However  one  cannot  use  the  digraph  algorithm  as 
was  done  in  the  case  of  FOLLOW]  sets  because  the  equations  Lemma  3.1.3  are  not  of 
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the  suitable  form  for  that  algorithm.  Hence,  a  recursive  algorithm  must  be  used  and  this 
algorithm  must  keep  track  of  all  state-nonterminal  pairs  that  are  visited. 

Combining  the  READ*:,  sets  above  with  the  equation  of  the  Lemma  3.1.3.  one  obtains 
the  following  extended  equations  for  FOLLOW*:  sets: 

Lemma  3.2.1 

FOLLO\Vo(  5/acA:,  A' )=  {f} 

FOLLOW  k(stack,X)  = 

[j{{a}.FOLLOW\k-\)(stack  +  [q],a)  \  a  £  DR(«5,A'),  q  =  GOTOo(<5,  A' )} 
UU{F0LL0W)t(5<ac^+  [q],y)  |  (ts,X)  reads  {q,Y)} 

U[J{F0LL0Wk(stack(l..4stack  -  \i\).C)\  [C  ^ -{  ■  X]  €  <5,  hK  1  <  if^Mack) 
UU{F0LL0Wfc([9],C)  i  [C  -  7i72  --V]  €  ts,  I72I  +  1  =  i^slack,  q  €  PRED(6facA-(l ).  71)}. 

These  equations  are  said  to  yield  the  set  of  symbols  of  length  k  that  can  follow 
in  the  context  of  the  configuration  (stack, X).  The  configuration  ([/>].  ^)  is  used  to 
initiate  the  computation  of  FOLLOW^lp, /I)  using  the  above  equations.  The  function 
FOLLOW^TEP  of  Figure  3.13  computes  a  step  for  a  FOLLOW^-  set,  based  on  the  equa- 
tions of  Lemma  3.2.1,  in  the  same  way  as  the  READ^TEP  function  of  Figure  3.12  com- 
putes a  step  for  READf  set.  FOLLOW_STEP  differs  from  READ^TEP  in  that,  given  a 
pair  (.<>/acA-, A')  as  input,  it  computes  all  succeeding  terminal  configurations  of  that  pair, 
even  those  that  are  outside  the  context.  In  other  words,  FOLLO\V_STEP  may  automat- 
ically trigger  the  computation  of  other  FOLLOW,  sets,  where  0  <  ?  <=  k.  Prior  to  any 
invocation  of  the  FOLLOW^TEP  function,  a  global  set  VISITED  must  be  initialized  to 
the  empty  set.  VISITED  is  subsequently  used  in  FOLLOW_STEP  to  keep  track  of  all  state- 
nonterminal  pairs  that  have  been  seen.  A  FOLLOW,  function  can  be  designed  that  uses 
the  FOLLOW_STEP  function  to  compute  FOLLOW;^  sets  just  like  the  READ,  function 
of  Figure  3.12  computes  RE.A.Da-.  sets  using  the  READ_STEP  function. 

From  Lemma  3.1.2.  one  observes  that  full  LALR/c  sets  for  a  given  state-item  pair  can 
be  easily  computed  given  the  relevant  FOLLOW^,  sets.  However,  the  computation  of  full 
LALR^  sets  (or  full  FOLLOW;^  sets  for  that  matter)  is  not  of  primary  interest.  In  the 
next  section,  the  incremental  computation  of  minimal  L.\LR^  lookahead  sets  for  a  given 
state  is  explored. 

3.3     Computation  of  Varying-Length  Lookahead  Strings 

As  described  earlier,  conflicts  are  resolved  one  at  a  time,  in  each  inconsistent  state,  for 
each  conflict  symbol.  Recall  that  an  LR(0)  state  is  inconsistent  if  it  consists  of  two  or  more 
items  and  at  least  one  of  these  items  is  a  final  item.  In  such  a  case,  one  cannot  determine 
which  action  to  execute  without  knowing  which  input  symbol  or  symbols  can  be  expected. 
Thus,  this  initial  inconsistency  of  9  can  be  viewed  as  a  conflict  on  (.  An  attempt  is  made  to 
resolve  it  by  computing  the  set  of  LALR(  1 )  lookahead  sets  that  can  appear  after  each  final 
item.  In  other  words,  without  consuming  any  input  symbol,  these  final  items  can  cause  a 
reduction  that  takes  the  automaton  into  other  states  where  these  lookahead  symbols  can 
be  read.  Since  GOTOo(9.f)  =  9,  state  q  can  also  be  reached  without  any  consumption  of 
input  symbols.  The  actions  that  can  be  executed  without  consuming  any  input  symbol 
will  be  referred  to  as  ^-actions. 

41 


A  state  q  is  said  to  be  LALR(  1 ),  if  the  intersection  of  its  LALR(  1 )  lookaliead  sets  and 
the  set  of  terminals  on  which  shift  actions  are  defined  in  it  is  empty.  If  a  state  q  is  not 
LALR(l),  each  symbol  a  in  this  intersection  is  an  LALR(l)  conflict  symbol  that  must  be 
extended  into  a  set  of  lookahead  strings  of  length  2  in  an  attempt  to  disambiguate  q.  If 
conflict  symbols  are  detected  in  the  sets  of  symbols  that  were  appended  to  a  to  extend 
it  into  a  set  of  lookahead  strings  of  length  2,  the  process  is  repeated  for  these  LALR(2) 
conflict  symbols,  then  for  LALR(3)  conflict  symbols  if  any,  and  so  on,  up  to  A-  -  I. 

In  order  to  extend  an  LALR(  1 )  conflict  symbol  a  in  a  state  q  into  a  longer  lookahead 
string,  the  relevant  actions  with  which  a  are  associated  must  be  identified.  This  will  include 
one  or  more  reduce  actions  and  perhaps  a  shift  action.  Once  the  relevant  actions  have  been 
identified,  one  must  find  all  the  sources  where  the  symbol  a  can  be  read  for  each  action. 
For  a  shift  action,  the  source  is  the  state  q  itself.  For  a  reduce  action,  the  sources  can  be 
found  by  retracing  the  paths  of  the  automaton  used  to  compute  the  L.\LR(1)  lookahead 
sets,  looking  for  the  relevant  states  where  the  symbol  a  can  be  read.  It  is  not  hard  to  see 
that  the  set  of  sources  for  the  reduce  action  of  a  final  item  [.4  —  uj-]  in  a  state  q  is  the 
following  set: 

stacks  =  {stack  |  [stack,  a]  €  F0LL0\V^TEP([7;].  .4).  p  €  PRED(^. ..•)}• 

Knowing  the  set  of  sources  slacks  of  a  symbol  a  that  is  associated  with  a  given  item, 
the  symbol  can  be  extended  into  a  set  of  lookahead  strings  of  length  2  as  follows: 

{a.b  I  [*/ac/>-.6]  £  F0LL0\V^TEP(6/A-.  .4 ),  sik  €  stacks]. 

Furtliermore.  note  that  as  the  FOLLO\V_STEP  function  is  extending  the  lookahead. 
it  is  also  computing  the  next  set  of  sources.  Hence,  the  FOLLO\V_STEP  function  can 
be  adapted  to  help  compute  the  minimal  LALR(A-)  lookahead  sets.  However,  one  can  do 
better  by  splitting  this  function  into  two  special  purpose  functions:  FOLLO\\'_SOrRC'ES 
(Figure  3.14).  which  given  a  configuration  (slack.. \  )  and  a  terminal  a  returns  the  set  of  all 
possible  sources  of  o  in  the  context  o{  (stack,. X);  and  a  function  NEXT_LA  (Figure  3.15). 
which  given  a  configuration  (slack, .\)  returns  the  set  of  terminals  that  can  follow  in  the 
context  of  a  configuration  (stack, .X).  Note  that  in  NEXTJLA.  instead  of  retracing  the 
reads  and  includes  paths  each  time,  the  RE.A.D1  and  FOLLOW]  sets  are  used  since  they 
are  already  available  for  the  computation  of  L.^LRll)  lookahead  sets. 

\\itii  these  two  functions  as  building  blocks,  the  LOOK_A.HEAD  function  of  Figure  3.16 
computes  the  minimal  L.\LR(/l-)  lookahead  sets  with  varying-length  string  in  a  straight- 
forward manner.  LOOK_AHE.'\D  takes  as  argument  an  inconsistent  state  q.  Its  first  step 
is  to  initialize  the  ACTION  map  for  each  pair  (q.a).  a  e  T.  into  the  singleton  {shift  p}. 
if  GOTOo(q.a)  =  p.  or  into  the  empty  set.  otherwise.  (Initially.  ACTIO.N  =  0.)  Next. 
LOOK_.AHE.-\D  computes  the  L.\LR(1 )  lookahead  set  for  each  final  item  in  q  and  uses  the 
symbols  in  these  lookahead  sets  to  update  the  ACTION  map  with  the  relevant  reduce  ac- 
tions. .After  having  completed  this  process,  if  9  is  a  LALR(l)  state,  i^  ACTIOS (q.n)  <=  1. 
Vo  £  T.  Otherwise,  each  terminal  that  does  not  satisfy  this  condition  is  an  L.A.LR(1) 
conflict  symbol.  For  each  such  symbol  a.  the  next  step  of  LOOK_AHE.\D  is  to  com- 
pute the  sources  of  a  for  each  action  with  which  a  is  associated  and  invoke  tlie  pro- 
cedure RESOL\'E.CONFLICTS  of  Figure  3.17  to  extend  a  into  an  appropriate  set  of 
non-conflicting  lookahead  strings. 
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function  FOLLOW  .SOURCES(stack,  X.  a), 
ts  ;=  stack(#stack), 
if  #stack  =  1  then 

if  [ts,  X]  €  VISITED  then 

return  0. 
end  if: 

VISITED  :=  VISITED  U  {[ts,  X]}; 
end  if; 

q  :=  GOTOo(ts.  X); 
if  a  €  DR(ts,  X)  then 

stacks  :=  {stack+[q]}; 
else  stacks  :=  0; 
end  if; 
for  (q,  Y)  I  (ts,  X)  reads  (q,  Y)  loop 

stacks  :=  stacks  U  FOLLO\VJSOURCES(stack  +  [q],  Y,  a), 
end  loop; 

for  [C  —7  X]  e  ts  I  C  9^  S'  loop 
if  hi  +  1  <  #stack  then 

stacks  :=  stacks  U  FOLLO\V^OURCES(stack(l   #stack  -  h|),  C,  a)) 
else 

asserl  ■)  =  -j^-jn  where  h^l  +  1  •=  #siack; 
for  q  e  PRED(stack(l),  71)  loop 

stacks  :=  stacks  U  FOLLO\V^OURCES([q],  C,  a), 
end  loop; 
end  if; 
end  loop; 
return  stacks, 
end  FOLLO\V_SOURCES, 


Figure  3.14:  F0LL0\\\50URCES  function 


RESOL\'E_CONFLICTS  is  a  recursive  procedure  that  is  invoked  %vith  4  arguments:  an 
inconsistent  state  q:  a  conflict  symbol  /:  a  mapping  sources  that  takes  each  action  act  in 
the  set  ACTION((yJ)  into  the  set  of  sources  where  /  can  be  read  following  a  sequence  of 
(-actions  induced  by  act  in  q:  and  an  integer  n  that  indicates  that  /  is  an  LALR(t?)  conflict 
symbol.  RESOLVE.CONFLICTS  first  checks  whether  or  not  n  >  k.  If  so.  it  returns 
immediately.  Otherwise,  it  allocates  a  lookahead  state  p.  initializes  ACTION(p. a)  =  0. 
Vn  €  7  and  resets  ACT10N(^./)  to  a  single  lookahead  shift  action  to  the  new  state  p. 
This  is  because  when  in  state  q.  if  the  next  input  symbol  is  t,  the  parser  needs  to  perform 
more  lookahead  actions  in  order  to  determine  its  next  move.  Next,  a  new  set  of  lookahead 
symbols  is  computed  for  each  action  and  for  each  new  lookahead  symbol  a,  ACT10N(p.  o) 
is  updated  accordingly.  If  no  conflicts  are  detected  in  the  new  lookahead  sets,  the  original 
LR(0)  state  that  started  this  process  is  LALR(n).  Otherwise,  for  each  LALR(tj)  conflict 
symbol,  new  sources  are  computed  and  the  process  is  repeated  with  a  recursive  invocation 
of  RESOLVE.COXFLICTS. 

After  the  LOOK_AHEAD  function  has  been  invoked  to  resolve  conflicts  for  eadi  incon- 
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function  NEXT_LA(stack,  X): 
ts  ;=  stack(#stack): 
q  ;=  GOTOo(ts.  X), 
la    =  READi(ts.  X). 
for  [C  — -)  X]  e  ts  I  C  #  S'  loop 
if  111  +  1  <  #stack  then 

la  :=  la  U  NEXT.LA(stack(l   #stack  -  |->|),  C); 
else 

assert  7  =  7172  where  |-,2|  +  1  =  #siack; 
for  q  €  PRED(stack(l).  7,)  loop 

la  :=  la  U  FOLLO\Vi(q,  C), 
end  loop; 
end  if, 
end  loop; 
return  la; 
end  NEXT.LA; 


Figure  3.15:  NEXT_LA  function 


sistent  LR(0)  state,  if  k  >  0,  the  grammar  is  LALR(A-)  if  and  only  if  the  following  three 
conditions  are  satisfied: 

1.  the  reads  relation  contjuns  no  cycle; 

2.  V.4  €  A.  A  ^;^  A: 

3.  for  each  LR(0)  or  lookahead  slate  p  and  terminal  symbol  a,  #ACTION(p.o)  <  1. 

If  either  condition  1  or  2  is  not  satisfied,  the  grammar  is  not  LR(A-)  for  any  k.  If  only 
condition  3  is  not  satisfied,  it  may  still  be  possible  to  construct  a  deterministic  L.-\LR 
parser  for  the  grammar  by  increasing  the  value  of  k.  However,  it  may  also  be  that  the 
grammar  is  ambiguous  or  that  it  is  an  LR(A-)  grammar.  In  those  cases,  no  amount  of  extra 
lookahead  will  help. 

3.4      Remarks 

In  practice,  this  incremental  approach  for  constructing  LALR(A-)  parsers  is  very  efficient 
because  most  constructs  in  an  LALR(A-)  grammar  are  in  fact  LR(0)  or  LALR(l).  More- 
over, when  it  is  necessary  to  use  more  lookahead,  an  LALR(A)  parser  with  varying-length 
lookahead  strings  is  often  as  space-efficient  as  an  LALR(  1 )  parser  that  is  obtained  by  trans- 
forming an  LALR(A-)  input  grammar.  This  is  because  duplication  of  some  constructs  is 
usucJly  necessary  lo  render  an  LALR(A-)  grammar  LALR(  1 ).  Thus,  the  resulting  L.ALR(  1 ) 
parsing  tables  may  contain  more  actions  than  their  LALR(/.')  counterpart  for  the  original 
grammar.  For  example,  consider  the  Pascal  grammar  in  Appendix  F.  That  grammar  is 
L.ALR(2)  because  of  the  following  constructs: 
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procedure  LOOK_AHEAD(q) 
for  a  £T  loop 

ACTION(q,  a)  :=  if  (p  :=  GOTOo(q.a))  ^  fi  then  {shift  p}  else  0  end: 
end  loop; 

for  each  final  item  A  — >  w-  £  q  loop 
for  a  €  LA(q,  A  —  u>-)  loop 

ACTION(q,  a)  :=  ACTION(q,  a)  U  {reduce  A  —  w}; 
end  loop; 
end  loop; 

for  a  €  r  I  #ACTION(q,  a)  >  1  loop 
sources  :=  0; 

for  act  €  ACTION(q.  a)  loop 
if  act  15  a  shift  action  then 

sources(act)  :=  {[q]}; 
else 

assert  act  =  reduce  A  —  w; 
sources(act)  :=  0; 
for  p  €  PRED(q.  ^)  loop 
VISITED  :=  0, 

sources(act)  :=  sources(act)  U  FOLLO\V^OURCES([p].A.a). 
end  loop; 
end  if. 
end  loop, 

RESOLVE.CO\FLICTS(q,  a,  sources.  2): 
end  loop; 
return; 
end  LOOK-AHEAD; 


Figure  3.16:  Procedure  to  compute  lookahead  sets  with  variable-length  strings 


if-statement         —      IF  expression  THEN 

statement 
I      IF  expression  THEN 

restricted.statennent  [:] 
ELSE  statement 


[:] 


( 


I 


variant43art  —  CASE  tagjieid  typeJdentifier  OF  variantjist 

tag.field  —  ( 

I  field.identifier  : 

f.eld.identifier      —  IDENTIFIER 

typeJdentifier      —  IDENTIFIER 

An  optional  ";"  was  added  before  ELSE  in  the  if^atement  rule  above.  This  is  an  ex- 
tension of  the  syntax  of  Pascal.  Since  P2LScal  programmers  commonly  make  the  mistake  of 
inserting  an  extraneous  semicolon  in  front  of  ELSE,  it  is  preferable  to  have  the  parser  be 
able  to  accept  such  an  input  and  emit  a  '"warning"  message  for  it  instead  of  an  error.  Unfor- 
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function  RESOLVE.CONFLICTS(q,  t,  sources,  n). 
if  n  >  k  then 

return; 
end  if. 

allocate  a  new  state  p: 
for  a  £  T  loop 

ACTION(p.  a)  :=  0; 
end  loop: 

ACTION(q,  t)  :=  {la-shift  p}, 

for  stacks  €  sources(act)  loop 

for  stk  €  stacks  loop 

for  a  e  NEXT_LA(stk,  t)  loop 

ACTION(p,  a)  :=  ACT10N(p.  a)  U  {  act  }; 
end  loop: 
end  loop: 
end  loop, 

for  a  €  T  I    #ACTION(p,  a)  >  1  loop 

new_sources  ;=  0, 

for  act  £  ACTION(p,  a)  loop 

new_sources(act)  :=  0: 

for  stk  €  sources(act)  loop 

VISITED  :=  0. 

new^ources(act)    =  new_sources(act)  U  FOLLO\V^OURCES(stk,t,a). 
end  loop, 
end  loop, 

RESOL\'E.CONFLICTS(p,  a.  ne>A(_sources.  n+1): 
end, 
return, 
end  RESOLVE.CONFLICTS, 


Figure  3.17:  RESOLVE.CONFLICTS  procedure 


tunately.  wlien  ":"  is  used  in  this  context,  the  parser  does  not  know  whether  to  reduce  the 
handle  by  statement  or  restricted^statement.  If,  on  the  other  hand,  the  parser  can  consuil 
two  symbols,  the  string  ~;  ELSE"  instructs  it  to  reduce  the  handle  to  restricted^statement. 
This  LALR(2)  conflict  cannot  easily  be  removed  from  the  grammar  without  some  major 
restructuring  of  the  rules.  (But.  since  it  is  an  extension  it  will  not  be  discussed  any  further.) 
In  the  case  of  a  variant4)art.  after  shifting  the  symbol  CASE,  if  the  paiser  can  only  con- 
sult one  input  symbol  and  that  symbol  is  IDENTIFIER,  the  parser  does  not  know  whether 
to  shift  it.  because  it  is  a  field.identifier;  or  to  reduce  tag.field  by  the  empty  rule,  because 
it  is  a  typejdentifier.  This  conflict  can  be  resolved  by  duplicating  the  variant_part  rule  and 
removing  the  <  choice  for  tagjield  as  follows: 

variant_part      —       CASE  typejdentifier  OF  variantjist 

I      CASE  tag.field  typejdentifier  OF  variantjist 
tag.field  —      fieldJdentifier  : 
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Table  D.l  in  Appendix  D  gives  some  information  about  different  LALR  parsers  con- 
structed with  the  method  described  in  this  thesis.  In  that  table,  Pascal2  refers  to  the 
LALR(2)  grammar  of  Appendix  F;  Pascall  is  the  same  grammar  as  Pascal2,  but  without 
the  optional  semicolon  preceding  ELSE  in  the  if_statement;  Pascal  is  the  same  grammar  as 
Pascall.  but  with  the  variant^jart  and  tag-field  rules  modified  as  suggested  above.  Note 
that  the  parsing  tables  for  Pascall  contain  fewer  actions  and  are  slightly  smaller  than  the 
parsing  tables  for  Pascal. 
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Chapter  4 

LALR(/c)  Parser  and  Error 
Recovery 

4.1      Overview 

4.1.1  The  Parsing  Framework 

An  LR  parsii^g  configuration  has  two  components:  a  state  stack  and  the  remaining  input 
tokens.  This  method  assumes  a  framework  in  which  the  parser  maintains  a  stale  stack, 
denoted  state_stack,  and  an  input  buffer  containing  a  fixed  number  of  input  symbols.  These 
symbols  include  the  current  token  or  lookuhead,  denoted  curtok  and  the  token  immediately 
preceding  the  current  token  (last  token  processed),  denoted  pnvtok.  The  remaining  tokens 
in  the  buffer  are  input  tokens  following  curtok.  A  number  of  attributes  are  associated  with 
each  input  symbol  such  as  its  class,  its  location  within  the  input  source,  its  character  string 
representation,  etc.  An  input  symbol  together  with  all  its  attributes  is  referred  to  as  a 
token  tUmtnt.  Each  state  q  in  the  state  stack  is  also  associated  with  certain  attributes 
including  the  grammar  symbol  that  caused  the  transition  into  q  (called  the  in.symbol  o{ 
q).  and  the  location  of  the  first  input  token  on  which  an  action  was  executed  in  q. 
An  LR  parsing  configuration  may  be  represented  by  a  string  of  the  form: 

91-92 9m      I     U-h 'n- 

The  sequence  to  the  left  of  the  vertical  bar  is  the  content  of  the  state  stack,  with  qm  ai  the 
top:  9i  ...^^  is  a  valid  sequence  of  states  in  the  LR  parsing  machine  corresponding  to  a 
viable  prefix.  The  sequence  to  the  right  of  the  vertical  bar  is  the  unexpended  input.  Each 
element  /,  represents  the  class  of  a  corresponding  input  symbol.  The  symbol  /j  represents 
the  class  of  the  current  token.  /2  represents  the  class  of  the  successor  of  the  current  token. 
etc.  The  symbol  t^  which  is  not  shown  above  represents  the  class  of  previous  token. 

For  simplicity,  it  will  be  assumed  that  the  grammar  used  to  construct  the  parser  is 
LR(1),  but  this  method  is  appLcable  lo  all  forms  of  LR(k)  parsers. 

4.1.2  Error  Recovery 

\  parsing  configuration  in  which  no  legal  action  is  possible  is  called  an  error  configuration. 
When  an  error  configuration  is  reached,  the  error  recovery  procedure  is  invoked.  Its  role  is 
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to  adjust  the  configuration  so  as  to  allow  the  parser  to  advance  a  minimum  predetermined 
distance  in  the  input  stream,  usually  two  or  three  tokens  past  the  repair  point.  The  token 
on  which  the  error  is  detected  is  referred  to  as  the  error  token  and  the  state  in  which  the 
error  is  detected  is  called  the  error  state. 

Three  kinds  of  recovery  strategies  are  used.  They  are: 

•  Simple  recovery.  A  simple  recovery  (also  called  first  level  recovery  [21])  is  a  single 
symbol  modification  of  the  source  text;  i.e.,  the  insertion  of  a  single  symbol  into 
the  input  stream,  the  deletion  of  an  input  token,  the  substitution  of  a  grammar 
symbol  for  an  input  token  or  the  merging  of  two  adjacent  tokens  to  form  a  single 
one.  Previous  authors  [2l][37]  have  used  a  more  restricted  form  of  simple  recovery 
involving  only  terminal  symbols  as  repair  candidates. 

•  Phrase-level  recovery.  In  phrase-level  recovery,  the  error  procedure  tries  to  recover 
by  either  deleting  as  smaC  a  sequence  of  tokens  as  possible  in  the  vicinity  of  the  error 
token  or  replacing  such  a  sequence  with  a  nonterminal  symbol.  This  technique  can 
be  viewed  as  an  automatic  generalization  of  the  error  productions  method  described 

in  [9]. 

•  Scope  recovery.  A  scope  is  a  syntactically  nested  structure  such  as  a  parenthesized 
expression,  a  block  or  a  procedure.  In  scope  recovery,  the  strategy  is  to  recover  by 
inserting  relevant  symbols  into  the  text  to  complete  the  construction  of  incomplete 
scopes. 

This  error  recovery  scheme  consists  of  two  phases  called  Primary  phas(  and  Secondary 
phase.  In  the  Primary  phase,  an  attempt  is  made  to  recover  with  minimal  modification  of 
the  remaining  input  stream.  Repairs  that  are  attempted  in  the  primary  phase  include  the 
different  kinds  of  simple  recoveries,  scope  recoveries  that  require  the  deletion  of  no  or  one 
input  symbol  and  phrase-level  recoveries  that  do  not  require  any  deletion  of  input  symbols. 
Figure  4.1  shows  some  examples  of  primary  phase  recoveries.  In  the  Secondary  phase,  more 
radical  approaches  involving  removal  of  some  left  context  (state  stack)  information  as  well 
as  multiple  deletion  of  tokens  from  the  remaining  input  (right  context)  are  attempted. 
Figure  4.2  shows  some  examples  of  secondary  phase  recoveries. 

4.1.3     Error  Detection 

A  canonical  LR(A-)  parser  has  the  capability  of  detecting  an  error  at  the  earliest  possible 
point.  However,  as  mentioned  earber,  canonical  LR(A-)  parsers  are  seldom  used  because 
of  their  size.  Instead,  variants  such  as  L.\LR(A-)  and  SLR(A:)  (usually  A-  =  1 )  are  used. 
These  LR  variants,  in  part,  solve  the  space  problem  by  always  using  the  underlying  LR(0) 
automaton.  However,  certain  states  in  these  parsers  usually  contain  reduce  actions  that 
may  be  illegal,  depending  on  the  actual  context.  Illegal  reduce  actions  do  not  cause  the 
resulting  parser  to  accept  illegal  inputs,  but  they  prevent  it  from  always  detecting  errors 
at  the  earliest  possible  point.  This  problem  is  usually  compounded  by  the  use  of  the  space- 
saving  technique  known  as  default  reduction  described  in  the  previous  chapter.  Another 
undesirable  side  effect  of  default  reductions  is  that  when  they  are  used,  it  is  no  longer 
possible  to  compute,  from  the  parsing  table,  the  set  of  terminal  symbols  on  which  valid 

49 


1.  prograa  TESTCIIPUT,  OUTPUT); 

2.  var  X,Y:  ajrray  []  of  integer; 

•••Error:  index_type_li6t  expected  after  this  token 

3.  begin 

4.  1:   X  :=  y. 

•••Error:  ;  expected  instead  of  this  token 

5.  if  X  ==  b  then  begin 

•••Error:  Unexpected  symbol  ignored 

6.  go  to  1; 
< > 

•••Error:  Symbols  merged  to  form  GOTO 

7.  a  :=  ((b  +  c) 


•••Error:  ")"  inserted  to  complete  phrase 

•••Error:  "EKD"  inserted  to  complete  phrase  staxted  at  line  5,  column  21 


8 .  end . 


Figure  4.1:  Primary  phase  recoveries 


1.  program  P (INPUT. OUTPUT) ; 

2.  procedure  FACTORIAL(X: INTEGER) : integer ; 

< > 

•••Error:  Unexpected  input  discarded 

3.  begin 

4 .  end ; 

5.  begin 

6.  if  count [listdata [sub]  :=  0  then 


•••Enror:  "]"  inserted  to  complete  phrase 
•••Error:  invalid  relational.operator 
7.        a  :=  ((b  4  c  ]]; 


<> 

•••Error:  ")"  inserted  to  complete  phrase 
•••Error;  ")"  inserted  to  complete  phrase 
•••Error:  Unexpected  input  discjo-ded 
8.  end. 


Figure  4.2:  Secondary  phase  recoveries 
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actions  are  defined  in  a  given  slate.  The  inability  to  detect  errors  as  soon  as  possible  and 
to  obtain  a  set  of  viable  terminal  candidates  for  a  given  state  is  very  problematic  for  error 
recovery. 

Furthermore,  even  with  a  canonical  LR(A-)  parser,  the  ability  to  detect  an  error  at  the 
earliest  possible  point  only  guarantees  that  the  prefix  parsed  up  to  that  point  is  correct. 
Therefore,  it  is  possible  that  the  token  on  which  an  error  is  detected  is  not  the  one  that  is 
actually  in  error.  Consider  the  following  Pascal  declaration: 

FUICTIOH  F(X:TINY,   Y:BIG,   Z:REAL); 

In  this  example,  it  is  very  difficult  to  deduce  the  actual  intention  of  the  programmer, 
but  a  simple  substitution  of  the  keyword  "PROCEDURE"  for  the  keyword  "FUNCTION"  would 
solve  the  problem.  However,  the  error  is  not  detected  until  the  semicolon  (  ; )  is  encountered 
or  15  tokens  later. 

In  [37].  Burke  and  Fisher  introduced  a  deferred  parsing  technique  where  two  parsers 
are  run  concurrently:  one  that  parses  normally  and  another  that  is  kept  at  a  fixed  distance 
(measured  in  terminal  symbols)  back.  When  an  error  is  encountered,  error  recovery  is 
attempted  at  all  points  between  the  two  parsers.  This  approach  avoids  the  premature 
reductions  problem  and  solves,  in  part,  the  problem  of  late  detection  of  errors.  However, 
the  overhead  of  the  two  parsers  penalizes  correct  programs. 

In  this  method,  a  new  LR  driver  routine  called  dtferrtd  driver  is  introduced.  This 
new  driver  can  effectively  detect  an  error  at  the  earliest  possible  point  even  if  the  parser 
contains  default  reductions.  It  can  also  be  adapted  to  defer  parsing  actions  on  a  fixed 
number  of  tokens  with  negbgible  slow-down  on  correct  programs.  To  achieve  this  goal,  an 
additional  state  stack  is  required  for  each  deferred  symbol.  Thus,  in  practice,  one  must 
restrict  the  number  of  symbols  on  which  actions  are  deferred. 

The  method  also  relies  on  having  two  mappings:  (.symbols  and  nt.symbols.  statically 
constructed,  which  yield  for  each  state,  a  subset  of  the  terminal  and  nonterminal  symbols, 
respectively,  on  which  an  action  is  defined  in  the  state  in  question.  These  subsets  are  the 
smallest  subsets  of  viable  error  recovery  candidates  for  each  state.  Their  computation  and 
optimization  will  be  discussed  in  chapter  -5. 

The  remainder  of  this  chapter  is  organized  as  foDows: 

•  detailed  description  of  the  new  driver 

•  presentation  of  various  recovery  techniques 

•  discussion  of  how  to  apply  these  recovery  techniques  and  how  to  issue  accurate  di- 
agnostics. 

4.2     The  Driver 

An  important  improvement  that  can  be  made  to  an  LR(A-)  automaton  is  the  removal  of 
LR(0)  ndvce  states.  An  LR(0)  reduce  state  is  a  state  that  consists  of  a  single  final  item. 
Therefore,  such  a  state  contains  only  reduce  actions  by  the  rule  from  which  the  final  item 
in  question  is  derived.  If  a  representation  of  the  parsing  tables  with  default  action  is  used. 
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the  parser  will  never  consult  the  lookahead  symbol  when  it  is  in  one  of  these  states.  Thus, 
such  stales  may  be  completely  removed  from  the  parser  by  introducing  a  new  parsing 
action:  read-reduce,  which  comprises  a  read  transition  followed  by  a  reduction.  If  q  is 
an  LR(0)  reduce  state  consisting  of  a  single  finaJ  item  [.4  —  qA'],  in  order  to  remove  q 
from  the  automaton,  the  parsing  tables  can  be  modified  as  foUows.  For  all  .'>tates  p  such 
that  GOTOo(p,  A')  =  q,  change  the  action  in  p  on  A'  to  a  read-reduce  action  by  the  rule 
A  -~  oX .  State  q  is  now  inaccessible  and  can  be  discarded.  A  read-reduce  action  is 
referred  to  as  a  shift-reduce  when  the  symbol  A'  in  question  is  a  terminal  symbol  and  as  a 
goto-reduce  action  when  A'  is  a  nonterminal. 

When  the  parser  encounters  a  shift-reduce  action  by  a  rule  A  —  at\  in  a  configuration 

9i,</2,.  ..,9m  I  <i,/2 ,^717  it  pops  \a\  elements  from  the  stale  sequence,  reads  the  input 

symbol  1\  and  executes  the  appropriate  nonterminal  action  on  the  state  qm-\Q\  and  A.  A 
goto-reduce  action  is  processed  in  a  similar  fashion  except  that  no  input  symbol  is  read  in. 
From  now  on,  it  is  assumed  that  all  LR(0)  reduce  states  have  been  removed  from  a  given 
parsing  table  and  replaced  with  read-reduce  actions. 

As  there  is  only  one  kind  of  action  that  can  be  executed  in  an  LR(0)  reduce  state, 
the  removal  of  these  stales  from  an  LR  automaton  does  not  cause  prtmatun  reductions. 
Moreover,  observe  that  when  the  parser  computes  a  read-reduce  action,  that  action  is 
followed  by  a  sequence  of  zero  or  more  goto-reduce  actions,  and  finally,  by  a  goto  action. 
.411  these  actions  may  also  be  executed  without  deferral,  by  the  same  argument. 

When  the  parser  executes  a  reduce  action  in  a  non-LR(O)  reduce  state,  that  action 
is  also  followed  by  goto-reduce  actions  and  a  final  goto  action.  If  the  reduce  action  in 
question  is  an  illegal  action,  executed  by  default,  then  all  the  associated  goto-reduce  and 
goto  actions  following  it  are  also  illegal  moves.  To  complicate  matters,  the  goto  action  may 
be  followed  by  a  sequence  of  reduce  actions  on  empty  rules,  each  followed  by  its  associated 
goto-reduces  and  a  final  goto  action.  In  such  a  case,  all  actions  induced  by  the  lookahead 
symbol  must  be  invalidated  and  the  original  configuration  of  the  parser  (prior  to  the  initial 
reduction)  must  be  restored. 

One  way  to  achieve  this  goal  is  as  follows.  When  a  reduce  action  ib  encountered,  copy 
the  state  stack  into  a  temporary  stack  and  simulate  the  parser  using  the  temporary  stack 
until  either  a  shift,  shift-reduce  or  error  action  is  computed  on  the  lookahead  symbol.  If 
the  first  non-reduce  action  computed  on  the  lookahead  is  valid,  the  temporary  stack  is 
copied  into  the  state  stack  and  the  parsing  can  continue.  Otherwise,  the  error  recovery 
routine  is  invoked  with  the  unadulterated  state  stack.  This  idea  captures  the  essence  of 
what  needs  to  be  done,  but  it  is  too  costl>  for  practical  use. 

Instead  of  copying  the  information,  the  temporary  stack  is  used  to  hold  the  values  of 
the  contiguous  elements  of  the  state  stack  that  have  been  added  or  rewritten.  If  the  moves 
turn  out  to  be  valid,  then  only  the  added  or  rewritten  elements  are  copied  to  the  state 
stack.  Otherwise,  the  original  configuration  is  passed  to  the  error  recovery  routine.  This 
idea  is  illustrated  in  the  lookahead  .action  function  of  Figure  4.3. 

Starting  with  a  given  configuration,  the  lookahead  .action  function,  in  es.'.ence.  checks 
whether  or  not  it  is  possible  to  advance  the  parse  past  the  current  token,  and  if  so.  it 
returns  the  first  non-reduce  action  induced  by  the  current  token.  As  a  side  eflfect,  it  also 
computes  the  index  position  (pos)  of  the  topmost  element  in  the  state  stack  sequence  that 
is  still  useful  after  all  the  actions  induced  bv  the  current  token  are  executed.    k\\  new 
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—  Assume  RHS  and  LHS  are  maps  that  yield  the  size  of  the  ngbi-hand  side  and  left-hand  side 

—  symbol  of  a  given  rule,  respeclively   ACTION  and  GOTO  are  the  terminal  and  nonterminal 

—  parsing  functions,  respectively 

1.  function  lookahead.action{stac\<..  tok.  pos): 

2.  temp_stack  :=  [  ]; 

3.  pos  :=  #stack: 

4  top    =  pos  -  1; 

5  act  :=  ACTION(stack(pos).  tok); 

6  while  act  is  a  reduce  action  loop 

7.  until  act  is  not  a  goto-redvce  action  loop 

8.  top  ;=  top  -  RHS(act)  +  1. 

9.  if  top  >  pos  then 

10  s  :=  temp_stack(top), 

11  else  s  :=  stack(top): 

12  end  if 

13  act  :=  GOTO(s.  LHS(act)). 

14  end  loop 

15  tempjstack(top+l)  :=  act; 

16  act  :=  ACTION(act,  tok), 

17  pos  :=  min(pos,  top), 

18  end  loop 

19  return  act 

20  end  lookahead.action. 


Figure  4.3:  lookahtad. action  function 


53 


act  :=  9o; 
state-Stack  :=  [  ]; 
locationjtack  :=  [  ]; 

while  act  is  not  the  accept  action  loop 
state-Stack  :=  statejstack  +  [act]: 
location_stack(#state_stack)  :=  curtojt. location: 

act  :=  lookahe  ad. action  (stack.  <].  pos):  -  -  recall  that  <i  =  cur/oit. class 

if  act  9^  error  action  then 

state_stack(pos+l..)  :=  tempjstack(pos+l.  ). 

location.stack(pos+l..)  :=  [rur^o/:. location  :  i  in  [pos+l..#state. stack]]: 
if  act  ts  a  shtft-reduce  action  then 
top  :=  #state.stack: 
while  act  is  a  goto-reduce  action  loop 
top   =  top  -  RHS(act)  +  1: 
act  :=  GOTO{statejtack(top),  LHS(act)); 
end  loop: 
stack(top+l..)  :=  [  ]: 
end  if. 

get  next  token; 
else 

error_recovery(): 
end  if 
end  loop: 


Figure  4.4:  Driver  with  one  deferred  token 


stack  information  above  that  index  position  is  stored  in  the  temporary  stack  in  its  relative 
position. 

Lemma  4.2.1  Let  stack_top=top+l  prior  to  entering  the  inner  loop  oj lookahtaJuclion. 
when  th(  inner  loop  is  exited,  top  <  stack.top. 

Proof;  The  first  iteration  of  the  loop  processes  an  initial  reduce  action.  At  the  end  of  the 
first  iteration,  the  variable  top  is  either  increased  by  1  (in  the  case  of  an  empty  production. 
RHS(act)  =  0).  left  unchanged  (in  the  case  of  a  unit  production),  or  decreased.  Eacli 
subsequent  iteration  of  the  loop  is  induced  by  a  goto-reduce  action  processing  a  non-empty 
rule  which  either  leaves  the  value  of  top  unchanged  or  decreases  it. 

When  the  inner  loop  is  exited,  the  transition  state  associated  with  the  goto  action 
that  caused  the  exit  is  stored  in  the  temporary  stack  at  position  top+1,  the  new  upper 
bound  of  the  stack.  If  top  is  a  new  upper  bound  for  the  initial  state  stack,  pos  is  updated 
accordingly.  When  the  outer  loop  is  exited,  the  action  that  caused  the  exit  is  returned. 

.Assume  that  the  starting  state  of  an  LALR(/r)  parser  is  denoted  by  qo  and  that  a  tempo- 
rary stack  denoted  temp_stack  is  globally  available.  The  algorithm  of  Figure  4.4  depicts  the 
body  of  a  driver  with  actions  deferred  on  one  token  symbol.  Initially,  the  parser  is  in  config- 
uration (Jo  I  -,•  where  90  is  the  start  state,  and  u;  is  the  whole  input  string.  Starting  with  the 
initial  configuration,  the  idea  is  to  advance  through  the  input  stream  one  token  at  a  time. 
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state^tack  :=  [90]; 
while  true  loop 

ppos  :=  0;      previous-Stack    =  [  ]; 

npos  :=  0;      next-stack  :=  [  ]; 

location_stack(#state^tack)  :=  rartoA-. location; 

temp-stack  :=  state_stack; 

act  :=  /oota/ifa(^_ac/ion(temp_stack,  1i.  pos); 

while  act  ^  ei-rorand  act  ^  accept  loop 

next_stack(npos+l..)  :=  temp^tack(npos+l ..); 
location_stack(pos+l.  )  :  = 

[cur/o/l. location  ;  i  in  [pos+l..#next_stack]]; 
if  act  ts  a  shift-reduce  action  then 
top  :=  #next_stack; 

until  act  ts  not  a  goto-reduce  action  loop 
top  :=  top  -  RHS(act)  +  1; 
act  :=  GOTO(next.stack(top),  LHS(act)); 
end  loop: 

next_stack(top+l    )    =  [act]; 
pos  :=  min(pos.  top). 
end  if 

act    =  /ooAa/iea(f_af/jon(next_stack.  t^.  npos). 
if  act  ^  error  then 
get  Tidt  token. 

previous.stack(ppos+l    )    =  state_stack(ppos+l..); 
ppos  ;=  pos. 

state_stack(pos+l  .)  :=  next_stack(pos+l ..); 
pos  :=  npos: 
end  if 
end  loop: 
if  act  =acf(p/then 

return, 
end  if 

error-recovery(): 
end  loop. 


Figure  4.5:  Driver  with  3  deferred  tokens 
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executing  all  actions  induced  by  the  current  input  token  at  each  step.  Thus,  the  function 
lookahead. action  is  invoked  at  each  step  to  check  whether  or  not  it  is  possible  to  advance. 
If  so,  the  state  stack  is  updated  by  replacing  all  of  its  topmost  elements  from  pos+1  to 
#tempjstack  by  the  corresponding  temp^tack  elements.  Usually.  pos+l=#temp_stack  un- 
less a  non-empty  sequence  of  empty  reductions,  each  followed  immediately  by  a  goto  action 
were  executed  prior  to  exiting  the  outer  loop  of  the  lookahtad .action  function.  In  such 
a  case,  #temp^tack  exceeds  pos-)-l  by  the  number  of  such  empty  reductions  that  were 
executed.  If  the  valid  action  returned  by  the  lookahead. action  function  is  a  shift-reduce 
action,  then  it  and  all  its  associated  goto-reduce  actions  are  processed.  When  the  parser 
can  advance  successfully,  the  next  token  is  read  in  and  the  process  is  repeated  on  the  new 
configuration.  If,  on  the  other  hand,  the  error  action  was  returned  by  the  lookahead. action 
function,  the  state  stack  is  not  updated  and  the  error  recovery  routine  is  invoked  instead. 
It  is  not  hard  to  see  how  this  driver  routine  can  be  adapted  to  defer  parsing  actions  on 
n  tokens  given  n  state  stacks.  In  experiments  with  this  method,  parsing  has  been  deferred 
for  three  tokens.  The  three  stacks  that  are  used  are:  previous-stack  which  captures  the  con- 
figuration of  the  parser  prior  to  processing  any  action  induced  by  prevtok.  state_stack  which 
captures  the  configuration  prior  to  processing  actions  induced  by  curiok.  and  next^tack 
which  captures  the  configuration  prior  to  processing  actions  induced  by  the  successor  of 
curtok.  Associated  with  each  of  these  stacks  are  three  integer  variables:  ppos.  pes  and 
npos  which  are  used  to  mark  the  position  of  the  top  element  in  the  corresponding  stack 
that  is  stiU  valid  after  the  actions  induced  by  the  relevant  lookahead  symbol  are  applied. 
Figure  A.b  shows  the  body  of  a  driver  routine  with  actions  deferred  on  three  input  symbols. 
This  deferral  usually  increases  the  time  requirement  of  the  parser  by  2b%  or  less. 

4.3      Recovery  Strategies 

Each  recovery  attempt  is  called  a  triul.  The  effectiveness  of  a  recovery  is  evaluated  using 
a  validation  function,  parst.check.  which  indicates  how  many  tokens  in  the  input  buffer 
can  be  successfully  parsed  after  the  repair  in  question  is  applied:  par&t.check  distance.  A 
recovery  trial  is  not  considered  successful  unless  the  parse.check  distance  is  greater  than  or 
equal  to  a  certain  value,  called  mm.disianct.  Experiments  have  shown  that  a  good  choice 
for  niin.distanct  is  2  [37]. 

The  pnrt.t.check  function  is  essentially  an  LR  driver  that  simulates  the  parse  until  it 
has  either  shifted  all  the  tokens  in  the  buffer,  completed  the  parse  successfully,  or  reached 
a  token  in  error.  The  same  approach  taken  in  implementing  the  lookahead. action  function 
can  be  extended  to  implement  the  parse.check  function:  i.e..  a  temporary  stack  can  be 
used  to  keep  track  of  all  state  information  related  to  transitions  induced  by  the  lookahead 
tokens  in  the  buffer,  thus,  avoiding  copying  the  state  stack  or  destroying  it. 

In  the  remainder  of  this  section,  the  implementation  of  the  three  recovery  strategies 
mentioned  earLer  are  described  in  detail. 

4.3.1      Simple  Recovery 

Given  a  configuration:  (j].</2 9rr,  I  'i-'2 'n.  where  1^  is  assumed  to  be  the  error  to- 
ken, the  simple  recovery  finds  the  best  possible  simple  repo;r(if  any )  for  that  configuration. 
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The  selection  of  a  best  simple  repair  is  based  on  three  criteria: 

•  the  parse.check  distance 

•  the  misspelling  index 

•  the  order  in  which  the  trials  are  performed. 

The  misspelling  index  is  a  real  value  between  0.0  and  1.0  that  is  associated  with  each 
simple  recovery  trial.  When  a  new  token  is  substituted  for  the  error  token  -  a  simple 
substitution  -  a  misspelling  function  is  invoked  to  determine  the  misspelling  index;  i.e.,  the 
relative  proximity  of  the  two  tokens  in  question  expressed  as  a  probabilistic  value.  For 
other  kinds  of  recoveries,  the  misspelling  index  is  set  to  a  constant  value  depending  on  the 
recovery  in  question  and  other  conditions.  This  wiU  be  discussed  later. 

Simple  recoveries  are  attempted  in  the  following  order:  merging  of  the  error  token  (/j ) 
with  its  successor  (ii);  deletion  of  <i;  insertion  of  each  terminal  candidate  in  t.symbolsiq^  ) 
before  /j:  substitution  of  each  legal  terminal  candidate  in  t.symbolsiq,^)  for  /j;  insertion  of 
each  nonterminal  candidate  in  nt_symbols{qjn)  before  t^:  and,  finally,  substitution  of  each 
nonterminal  candidate  nt.symbols{q„^)  for  /i:  For  now,  one  can  assume  that  for  a  state 
q.  t.symbolsiq)  and  nt.symboIs{q)  yield  the  sets  of  all  terminal  and  nonterminal  symbols, 
respectively,  on  which  actions  are  defined  in  q.  Optimization  of  these  sets  is  discussed  in 
section  5.2.1. 

As  the  trials  are  performed,  the  simple  recovery  routine  keeps  track  of  the  most  succesful 
trial.  Initially,  the  merge  recovery  is  chosen  since  it  is  attempted  first.  If  a  subsequent 
recovery  yields  a  larger  parse.check  distance  than  the  previously  chosen  recovery  or  it 
yields  the  same  parse.check  distance  but  with  a  greater  misspelling  index,  then  it  is  chosen 
instead  as  the  best  recovery  candidate. 

For  the  merge  trial,  the  character  string  representation  of  /2-  's  concatenated  to  the 
character  string  representation  of /j  to  obtain  a  merged  string  &.  A  test  is  then  performed 
to  determine  if  ,<•  is  the  character  string  representation  of  some  /  €  t.symbols{q^).  If 
such  an  element  /.  called  a  merge  candidate,  is  found,  a  new  configuration  is  obtained  by 
temporarily  replacing  /]  and  I2  with  /  in  the  input  sequence  and  the  parse.check  distance 
is  computed  for  this  new  configuration. 

.■\s  described  in  the  previous  section,  the  deferred  driver  insures  that  the  state  ^„,  on 
top  of  the  stack  of  the  error  configuration  is  the  state  entered  prior  to  the  execution  of  any 
action  on  /].  In  that  configuration,  it  may  be  possible  to  execute  a  sequence  of  reduce, 
goto-reduce  and  goto  actions  before  the  illegality  of  /]  is  detected  in  another  slate  q^.  In 
such  a  case,  the  elements  in  t.symbols{q,r.)  that  are  also  in  t.symbols{q, )  are  given  priority 
in  applying  the  insertion  and  substitution  trials.  (It  is  not  hard  to  show  that  t.synibol.^iq, ) 
C  t.symbols(q^).)  This  ordering  is  not  crucial  but  its  benefits  can  be  seen  in  the  following 
example: 

write(l*5+6;2*3,4/2) 

In  this  erroneous  Pascal  statement,  a  semicolon  is  used  instead  of  a  comma  after  the  first 
parameter.  Assume  state  9^  is  the  first  state  that  encounters  the  semicolon.  At  that  point. 
the  parser  has  just  shifted  an  expression  operand  and  the  set  of  valid  lookahead  symbols 


includes  not  only  the  comma  but  all  the  arithmetic  operators.  However,  if  the  parser  is 
allowed  to  interpret  the  operand  as  a  complete  expression,  it  will  enter  an  error  state  q^ 
where  the  comma  is  the  only  candidate. 

In  order  to  give  priority  to  the  candidates  in  an  error  state  q^.  it  is  necessary  to  identify 
when  the  parser  has  entered  such  a  state.  State  q^  can  be  computed  in  the  lookahtad. action 
function  by  inserting  the  following  statement  after  lines  3.  and  14.  in  Figure  4.3: 

error_state  :=  act; 

4.3.1.1  Misspelling  Index 

For  a  successful  merge  trial,  the  misspelling  index  is  set  to  1.0  since  the  merged  string  must 
perfectly  match  the  character  string  representation  of  the  merge  candidate.  This  ensures 
that  if  a  merge  trial  yields  a  successful  recovery,  any  subsequent  recovery  that  is  applicable 
in  the  given  context  will  not  be  chosen  over  the  merge  recovery  unless  it  yields  a  longer 
pari^t_chtck  distance.  Consider,  the  following  example: 

if   X  >     =  y  then 
X    :      =  z; 

In  this  case,  it  is  most  likely  that  the  programmer  inadvertently  inserted  a  space  character 
between  ">"  and  '■="  in  the  first  line,  and  between  ":"  and  ""="  in  the  second  line.  Both 
errors  can  be  successfully  repaired  with  merging.  However,  from  a  syntactic  point  of  view. 
the  first  error  is  just  as  easily  repaired  by  deleting  the  ">*"  symbol  (or  the  "="):  but,  since 
the  merging  trial  is  perfomed  first  and.  in  addition,  it  yields  a  higher  misspelling  index,  it 
is  chosen  over  the  deletion. 

As  mentioned  earlier,  a  misspelling  function  is  invoked  to  calculate  the  misspelling 
index  for  a  simple  substitution.  For  all  other  recoveries,  the  misspelling  index  is  set  to  0.0 
unless  the  candidate  in  question  was  identified  as  a  special  tnd-oj-hnt  terminal  and  the 
correction  is  to  be  made  at  the  end  of  a  line.  In  that  case,  the  misspelling  index  is  set  to 
1.0.  (A  boolean  attribute  is  iissociated  with  each  input  token  indicating  whether  or  not 
it  is  located  at  the  end  of  a  line  in  the  input).  In  Pascal,  the  end-of-line  terminal  is  ";". 
Consider  the  following  example; 

X    :=  y 

p(x); 

The  error  token  in  this  incorrect  fragment  is  p.  Syntactically,  any  arithmetic  operator, 
say  "♦".  inserted  after  the  symbol  y  would  repair  this  error.  However,  it  is  clear  from 
the  context  that  since  y  is  at  the  end  of  the  line.  ";*"  is  a  better  candidate.  Setting  the 
misspelling  index  to  1.0  when  inserting  the  end-of-bne  terminal  at  the  end  of  a  line  gives 
it  precedence  over  other  insertion  candidates. 

4.3.1.2  Misspelling  Function 

In  designing  a  misspelling  function  for  syntactic  error  recovery,  special  attention  should 
be  paid  to  the  kinds  of  errors  that  a  programmer  is  likely  to  make.  For  example,  the 
use  of  abbreviations,  such  as  int  instead  of  integer  or  proc  instead  of  procedure,  is 
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common.  Often,  words  that  start  with  a  similar  prefix  are  substituted  for  each  other:  e.g., 
procedure  for  program.  The  algorithm  used  should  be  sensitive  enough  to  identify  such 
a  pajr  of  words  as  a  misspelling  error  (though  with  a  low  probability);  but,  it  should  also 
be  able  to  reject  a  pair  of  words  such  as  return  and  turner  that  are  permutations  of  the 
same  set  of  characters  but  unlikely  to  be  incorrect  misspellings  of  each  other. 

The  misspelling  algorithm  used  in  this  method  is  an  adaptation  of  an  algorithm  pro- 
posed by  Jiiergen  Uhl  [43].  In  that  approach,  three  kinds  of  misspelling  errors  are  identified: 
transposition  of  two  adjacent  characters,  mismatch  of  corresponding  characters  and  omis- 
sion (insertion)  of  a  character.  The  idea  is  to  scan  the  two  strings  simultaneously  and 
compute  the  following  information  as  they  are  being  traversed. 

•  the  length  of  the  initiaJ  prefix  of  the  two  strings  that  matches  (prefixJength) 

•  the  total  number  of  characters  that  match  (match_count) 

•  the  number  of  errors  found  (error_count) 

With  this  information,  the  probability  that  a  string  si  is  a  misspelling  of  a  string  62  is 
calculated  as  follows.  Let  the  error  threshold  be  the  length  of  the  shortest  string  divided 
by  six  plus  one;  i.e.,  at  least  one  error  is  allowed  plus  an  additional  error  for  each  six 
characters  in  the  shortest  string.  If  the  number  of  errors  detected  by  the  pattern  match 
(errorjcount)  is  less  than  or  equal  to  the  error  threshold  then  match_count  represents  the 
result  of  the  pattern  match  (pattern.count):  otherwise,  only  the  number  of  initial  characters 
that  matched  (prefixJength)  are  considered.  The  final  result  is  obtained  by  dividing  pat- 
tern.count by  the  length  of  the  longest  string  plus  errorjcount.  Figure  4.6  shows  a  complete 
implementation  of  this  algorithm. 

Certain  static  characters  are  also  likelv  to  be  substituted  for  others.  For  example. 


In  this  method,  the  characters  of  each  of  the  above  pairs  are  considered  to  be  0..3.3'a  likely 
to  be  a  misspelling  of  each  other. 

4.3.2     Phrase-Level  Recovery 

Phrase-level  recovery  is  based  on  the  identification  of  an  trror  phrast  which  is  then  deleted 
from  the  input  or  replaced  by  a  suitable  nonterminal  symbol  or  reduction  goal.  If  the 
string: 

qu.-..(lm     I     '1 ^n  (4.1) 

is  an  error  configuration,  then  a  substring 

9.  +  1 9m       I       '1 /j-1  (-i-'^) 
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—  The  function  misspell  takes  two  arguments:  si  and 

—  s2  which  are  character  strings  of  arbitrary  length. 

function  misspell(sl,  s2): 
i  :=  0; 
j:=0: 

matchjcount  :=  0; 
prefix_length  :=  0; 
error.count  :=  0. 
while  ((1  <  #sl)  and  (j  <  #s2))  loop 

if  sl{i)  =  s2(j)  then       —  matched  characters? 

matchjcount  :=  match_count  +  1; 

i  :=  i  +  1. 

if  error.count  =  0  then      —  prefix  character? 
prefix-length  :=  prefixJength  +  1: 

end  if, 
elseif  sl(i+l)  =  s2(j)  and  sl(i)  =  s2(J+l)  then       —  transposition' 

matchjcount  :=  matchjcount  +  2: 

i  :=  i  +  2: 

j:=J  +  2; 

error-count  :=  error jcount  +  1: 
elseif  sl[i+l]  =  s2[j+l]  then       —  mismatch'' 

i  :=  i  +  1; 

error.count    =  errorjcount  +  1; 
else  —  definitely  a  deletion' 

if  (#sl-i)  >  (#s2-j)  then       —  suffix  of  si  longer' 

i  :=  i+  1; 
elseif  (#s2-j)  >  (#sl-i)  then       —  suffix  of  s2  longer' 

else 

i  :=  i  +  1. 

end  if 

error jCOunt  :=  error jcount  +  1. 
end  if. 
end  loop. 
if  (i  <  #sl)or  (j  <  #s2)  then 

error.count  :=  error.count  +  1: 
end  if 
if  error.count  <=  (min(#sl.  #s2)  /  6  +  1)  then       —  check  error  threshold 

pattern.count  :=  prefixJength. 
else  pattern.count  :=  match.count. 
end  if. 

return  float(pattern_count)  /  (max(#sl.  #s2)  +  error.count). 
end  misspell. 


Figure  4.6:  Misspelling  function 
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1  <  i  <  m,  1  <  j  <  n,  o{  that  configuration  is  an  error  phrase  (of  the  configuration)  if 
removing  that  substring  allows  the  parser  to  advance  at  least  min.disiancf  tokens  into 
the  forward  context,  or  if  there  is  a  nonterminal  A  such  that  a  valid  action  is  defined  in 
state  q,  on  A.  and  after  processing  A.  the  parser  can  advance  at  least  min. distance  into 
the  forward  context.  Here.  9,,  A  and  tj  are  the  recovery  state,  reduction  50a/ and  recovery 
symbol,  respectively. 

In  [27],  the  authors  present  a  detailed  discussion  of  different  strategies  that  are  used 
for  selecting  error  phrases,  and  the  advantages  and  disadvantages  of  each  strategy.  In 
general,  the  search  for  the  error  phrase  begins  with  the  shortest  possible  error  phrase;  that 
is,  with  the  one  consisting  only  of  the  vertical  bar,  and  proceeds  with  larger  and  larger 
segments  of  the  configuration.  A  search  order  may  be  used  that  consumes  state  stacks 
faster  than  input  symbols  or  consumes  input  symbols  faster  than  state  stacks  or  a  more 
balanced  scheme  may  be  used. 

The  scheme  used  in  this  method  to  select  error  phrases  reflects  a  fundamental  distinc- 
tion that  is  made  among  three  different  kinds  of  errors.  Consider  the  error  configuration 
(4.2)  above.  The  case  of  the  empty  error  phrase  is  considered  during  simple  recovery  as  a 
nonterminal  insertion.  Similarly,  the  case  of  an  error  phrase  s\ti  being  deleted  or  replaced 
by  a  nonterminal  candidate  is  processed  by  a  simple  deletion  or  substitution.  Next,  priority 
is  given  to  a  successful  phrase-level  recovery  that  consumes  no  input  symbol  and  requires 
no  insertion  of  a  reduction  goal:  i.e.,  a  recovery  based  on  the  removal  of  an  error  phrase 
of  the  form  3\s  where  3  ^  s.  This  kind  of  error  is  called  a  rnisplacemenl  error,  and  3  is 
called  a  misplnctd  phrai^t.  The  following  Pascal  example  illustrates  this  case: 

1.  program  P (IHPirT, OUTPUT )  ; 

2.  v2Lr  I: real; 
< > 

•♦•Error:  Misplaced  construct (s) 

3.  type  ORDER=array[l. .MAX]  of  real; 

4.  var  Q: integer; 

5.  begin 

6.  end. 

Finally,  the  case  in  which  one  or  more  input  symbols  and/or  states  must  be  deleted 
or  replaced  with  a  nonterminal  candidate  is  considered.  In  that  case,  input  symbols  are 
consumed  faster  than  states.  In  other  words,  the  error  phrases  are  selected  as  indicated 
by  the  row-major  order  of  the  table  below: 


S\S  ■■■  £\h 1„ 

9m  k  •••  9m|'l 'n 

92 9m  1-^      •••      92 9m  I'] t„ 

In  this  final  case,  each  error  phrase  selected  is  removed  from  the  base  configuration  (4.1 ). 
An  initial  attempt  is  made  to  recover  by  parse  checking  the  resulting  configuration.  This 
action,  called  phrase  deletion,  can  be  viewed  as  a  multiple  deletion  of  the  symbols  that 
make  up  the  error  phrase.  Next,  each  element  in  the  set  of  nonterminal  candidates  for  the 
newly  exposed  state  on  lop  of  the  slate  slack  is  substituted,  in  turn,  for  the  error  phrase 
and  the  parse  .check  function  is  invoked  to  determine  its  viability.  This  action  is  called  a 
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if_stmt       —      if  cond  then 

stJist  elsifJist  opt.else 
end  if ; 
StJist  —      stmt  I  StJist  stmt 

elsifJist      —      (  I  elsifJist  elsif  cond  then  st_list 
opt.else      —      (  I  else  stJist 
stmt  —      ...   I    if.^tmt    I    ... 


Figure  4.7:  BNF  rule  for  Ada  if  statement 


phrase  substitution.  This  process  continues  until  a  successful  recovery  is  found  or  all  the 
possibilities  are  exhausted. 

In  phrase-level  recovery,  the  aim  is  to  find  a  repair  that  least  alters  the  original  con- 
figuration. For  this  reason,  misplacement  trials  are  performed  separately  from  the  other 
phrase-level  trials  and  given  higher  priority,  since  such  a  repair  does  not  deleie  any  symbol 
from  the  forward  context  and  tends  to  remove  whole  structures  from  the  left  context  that 
have  been  previously  analysed.  The  parse. check  distance  is  used  as  the  criterion  to  select 
the  best  misplacement  repair.  After  the  misplacement  trials,  a  phrase  deletion  and  sub- 
stitution trial  is  performed  on  successive  error  phrases.  The  selection  of  a  best  deletion  or 
substitution  repair  is  based  on  the  length  of  the  relevant  error  phrase  and  the  parst.chtck 
distance,  with  deletion  having  priority  over  substitution  in  case  of  a  tie.  The  length  of  an 
error  phrase  3\x  is  obtained  by  adding  the  length  of  the  string  x  to  the  number  of  non-null 
symbols  in  J. 

Given  the  best  misplacement  repair  and  the  best  deletion  or  substitution  repair,  if  the 
misplacement  repair  is  based  on  a  shorter  error  phrase  or  it  yields  a  longer  parsi.chfck 
distance,  then  it  is  chosen.  Otherwise,  the  deletion  or  substitution  is  chosen. 

4.3.3     Scope  Recovery 

One  of  the  most  common  errors  committed  by  programmers  is  the  omission  of  block 
closers  such  as  an  end  statement  or  a  right  parenthesis.  Such  an  error  is  referred  to  as  a 
scope  error.  Scope  errors  are  common  because  the  structures  requiring  block  closers  are 
usually  recursive  structures  that,  in  practice,  are  specified  in  a  nested  f<tshion.  In  such  a 
case,  a  matching  block  closer  must  accompany  each  structure  in  the  nest.  For  example, 
if  a  user  specifies  an  expression  that  is  missing  a  single  right  parenthesis,  simple  recovery 
can  successfully  insert  that  symbol.  However,  if  two  or  more  right  parenthesis  are  missing, 
neither  simple  nor  phrase-level  recovery  can  successfully  repair  such  an  error.  Similarly, 
consider  the  BXF  rule  for  an  .Ada  ij-statiment  in  Figure  4.7  [28]:  If  an  .Ada  if-stattnunt 
is  specified  without  the  "end  if;"  closer,  neither  of  the  two  recovery  techniques  mentioned 
so  far  can  effectively  repair  this  error.  The  repair  that  is  necessary  for  this  kind  of  error  is 
the  insertion  of  a  sequence  of  symbols,  called  multipU  symbol  insertion. 

Scope  recovery  was  first  introduced  by  Burke  and  Fisher  [37].  Their  technique  requires 
that  each  closing  sequence  be  supplied  by  the  user  as  a  Lst  of  terminal  symbols.  Scope 
recovery  is  attempted  by  checking  whether  or  not  the  insertion  of  a  combination  of  these 
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closing  sequences  can  allow  the  parser  to  recover. 

By  contrast,  the  scope  recovery  technique  used  in  this  method  is  based  on  the  identifi- 
cation of  one  or  more  recursively  defined  rules  that  are  incompletely  specified,  and  insertion 
of  the  appropriate  closing  symbols  to  complete  these  phrases.  All  necessary  scope  informa- 
tion required  by  this  method  is  precomputed  automatically  from  the  input  grammar.  In 
addition,  because  the  method  is  based  on  a  pattern  match  with  complete  rules  rather  than 
just  the  insertion  of  closing  sequences  of  terminzd  symbols,  the  diagnosis  of  scope  errors  is 
more  accurate  in  that  it  identifies  whole  structures  that  are  incompletely  specified  instead 
of  just  the  missing  sequence  of  closing  terminals. 

4.3.3.1      Scope  Information 

Definition  4.3.1  A  rule  A  -^  aB3  ii>  o  scoped  rule  if  and  only  if  the  following  conditions 
are  satisfied: 

1.  B  =>'  -)At',  for  some  arbitrary  strings  ■)  and  6 

2.  3^-  £ 

3.  Q  ^  E  or  B  ^'^  At,  for  some  string  rp 

4-  Q  ^  t  or  {3  a  viable  prefix  dA  \  either  (t>B  is  not  a  viable  prefix  or  JC  €  N  such  that 
oC  =>;^  <t)A  and  <t>C  =>+„  <pB). 

Condition  1  simply  states  that  for  a  rule  to  be  a  scoped  rule  it  must  be  recursive. 
Condition  2  guarantees  that  the  closing  sequence  following  the  recursive  symbol  in  the 
right-hand  side  of  the  rule  is  not  nullable.  These  two  conditions  are  consistent  with  the 
intuitive  notion  of  a  scope  presented  earlier.  By  contrast,  the  purpose  of  conditions  3  and 
A  is  to  exclude  some  unnecessary  cases.  For  example,  consider  the  following  left-recursive 
rule: 

List  —  List  .  atom 

Condition  3  eliminates  such  a  rule  from  consideration  since  the  initial  prefix  preceding  the 
recursive  nonterminal  List  in  the  right-hand  side  is  empty  and  List  =>;'^  List.  The  rules 
affected  by  condition  4  are  more  complex.  Consider  the  following  grammar: 


List 

— 

Sublist 

List 

— 

Sublist  List 

Sublist 

— 

atom 

Sublist 

— 

(  List  ) 

Since  Sublist  =>  (  List  ).  the  right -recursive  rule  List  —  Subhst  List  satisfies  conditions  1. 
2  and  3.  However,  it  does  not  satisfy  condition  4,  because  List  =>+„  Sublist.  In  essence, 
condition  4  eliminates  from  consideration  all  rules  of  the  form  A  —  B3  such  that  whenever 
A  is  introduced  through  closure,  it  is  introduced  by  some  nonterminal  C  (where  C  may 
be  the  symbol  A  itself,  as  in  the  above  example)  which  can  produce  both  A  and  B  by  a 
sequence  of  right-most  derivations. 

In  the  above  example,  the  rule  Sublist  —  (  List  )  is  a  scoped  rule.    In  the  example 
of  Figure  5.5,  the  rule  P  —  (£)  is  a  scoped  rule  since  P  can  be  derived  from  E.   The 
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if jtmt  rule  of  Figure  4.7  is  also  a  scoped  rule  since  each  of  the  symbols:  stJist.  elsifJist 
and  opt_else  in  that  rule  can  recursively  derive  a  string  containing  the  symbol  if_stmt.  A 
scope  can  be  derived  from  a  scoped  rule  for  each  recursive  symbol  in  the  right-hand  side 
of  the  scoped  rule. 

A  scope  is  a  quintuple  {Tt^a,a,A,Q)  where  tt  and  o  are  strings  of  symbols  called  scope 
pre fii  Bind  scope  suffix,  respectively,  a  is  a  terminal  symbol  called  the  scope  lookahead.  A  is 
a  nonterminal  symbol  called  the  left-hand  side  and  Q  is  a  set  of  states.  The  scope  prefix 
is  the  prefix  of  a  suitable  string  derivable  from  the  scoped  rule  in  question.  It  is  used  to 
determine  whether  or  not  a  recovery  by  the  associated  scope  is  applicable;  i.e..  at  run  time, 
a  repair  by  a  given  scope  is  considered  only  if  this  initial  substring  of  the  suitable  string 
can  be  successfully  derived  before  the  error  token  causes  an  error  action.  The  scope  suffix 
is  the  suffix  (of  the  suitable  string)  that  follows  the  scope  prefix.  When  diagnosing  a  scope 
error,  the  user  is  advised  to  insert  the  symbols  of  the  scope  suffix  into  the  input  stream  to 
complete  the  specification  of  the  scoped  rule.  The  scope  lookahead  symbol  (string,  if  the 
grammar  is  LR(A-),  k  >  0)  is  a  terminal  symbol  (string)  that  may  immediately  follow  the 
prefix  in  a  legal  input.  The  left-hand  side  of  the  scope  is  the  nonterminal  on  the  left  of  the 
scoped  rule.  The  set  Q  contains  the  states  of  the  LR(A-)  automaton  in  which  the  left-hand 
side  can  be  introduced  through  closure. 

Given  a  scoped  rule  A  —  aB3.  the  scope  information  related  to  B  is  computed  as 
follows.  Since  '3  ^'  e.  there  e.xists  a  string  ^'A'o  such  that  3  =>'  vXo.  f  =>'  s.  and 
A'  =>;„  a^\  for  some  string  uj.  Let  qBvA'o  be  the  suitable  string  mentioned  above,  then 
a  valid  scope  for  the  rule  A  —  aB3  is  (qBi\X o.a.A.Q).  where  Q  is  the  set  of  states  in 
the  LR  automaton  containing  a  transition  on  A. 

As  an  example,  consider  the  if^tmt  rule  of  Figure  4.7  and  the  scope  induced  by  the 
nonterminal  stJist  in  its  right-hand  side.  To  put  it  in  the  form  A  —  oB3.  let  B  be 
the  symbol  "stJist".  It  foDows  that  q  is  the  string  "if  cond  then",  and  3  is  the  siring 
"elsifjist  opt_else  end  if;".  Let  f  be  the  string  "elsifJist  opt.eise"  and  let  A'  be  the  symbol 
"end".  One  observes  that  3  is  exactly  in  the  desired  form  V-Vo.  Thus,  assuming  the  set 
of  transition  states  Q  is  available,  the  scope  induced  by  stJist  for  the  if  jtmt  rule  is: 

(if  cond  then  stJist  elsifJist  opt.eise.  end  if.,  end.  if.^tmi  Q) 

The  other  recursive  symbols  in  the  if^stmt  rule  -  elsifJist  and  opt-else  -  induce  exactly  the 
same  scope  as  stJist.  since  they  are  both  nuUable. 

4.3.3.2      Scope  Error  Detection 

Given  an  error  configuration: 

Vi 9m     I     'i 'n 

and  a  set  of  scopes: 

{(?:,. tTi.fli.y4i,(?i).    ...,   (r;,  £7;,  a/,  >!;,(?/)}, 

the  applicability  of  scope  recovery  to  this  configuration  is  determined  as  follows.  For  each 
scope  (7r,,fT,.fl,. /4,.(^,),  a  three-step  test  is  performed: 
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step  1:  The  lookahead. action  function  is  invoked  with  a,  as  the  current  token  to  check 
if  a,  is  a  valid  lookahead  symbol  for  the  viable  prefix.  As  a  side-effect,  this  func- 
tion updates  the  state  stack  configuration  (using  a  temporary  stack)  to  reflect  aU 
reduce  actions,  including  empty  reductions,  induced  by  a,.  If  the  action  returned  by 
lookahtadaciion  is  the  error  action  then  the  whole  test  fails.  Otherwise,  assuming, 
once  again,  that  m  denotes  the  upper  bound  subscript  of  the  updated  state  sequence, 
step  2  is  executed. 

step  2:  A  pattern  match  is  made  between  the  prefix  7r,  and  the  topmost  |;r, |  symbols  of 
the  viable  prefix,  i.e.,  the  string  obtained  from  the  concatenation  of  the  in_symbols 
of  the  states:  9m_|,r.|-)-i--9m-  Again,  if  this  test  fails,  the  whole  test  fails.  Otherwise 
the  final  step  is  executed. 

step  3:  If  9„_|^_|  €  Q,  then  the  test  succeeds.  Otherwise,  the  test  fails. 

If  the  three-step  test  is  successful,  then  a  parse  check  is  performed  on  the  configura- 
tion:  qx ,^m-|7r.|i9/i  I  'i,...'n.  w'here  q^  is  the  successor  state  of  9^  and  A^ .    If  the 

par&e.check  function  can  parse  at  least  min. distance  symbols,  the  scope  recovery  is  suc- 
cessful. Otherwise,  it  is  invoked  recursively  with  the  new  configuration  above  and  the 
process  is  repeated  until  scope  recovery  either  succeeds,  or  there  are  no  more  possibilities 
to  try. 

When  scope  recovery  is  successful,  the  sequence  of  scopes  that  resulted  in  the  successful 
recovery  must  be  saved  for  the  issuance  of  an  accurate  diagnostic. 

Figure  -4.8  depicts  a  complete  implementation  of  the  scope  error  detection  algorithm. 
The  algorithm  mirrors  the  preceding  discussion  in  a  straightforward  manner  except  for  the 
tests  in  line  12  and  the  code  segment  in  lines  2  through  5.  In  line  12.  the  test  for  top  >  0 
ensures  that  the  stack  is  longer  than  the  scope  prefix  with  which  it  is  being  matched  and 
the  test  for  top  <  min(#stack,  pos-|-l)  prevents  the  algorithm  from  considering  a  scope 
whose  prefix  would  match  a  null  string.  This  latter  test  also  guarantees  that  each  time  a 
scope  is  tested  and  applied  it  does  nol  extend  the  length  of  the  stack.  The  code  segment 
in  hnes  2  through  5  prevents  the  algorithm  from  visiting  a  given  configuration  more  than 
once.  Since  the  application  of  a  scope  never  extends  the  state  sequence  it  started  with, 
one  only  needs  to  keep  track  of  the  states  that  have  been  entered  at  each  index  position 
of  the  initial  state  sequence. 

The  emphasis  in  writing  the  code  of  Figure  4.8  was  on  the  clarity  of  the  exposition  ratiier 
than  efficiency.  In  particular,  note  that  stack  should  not  be  copied  into  scope_stack(as  in 
line  7)  for  each  appLcation  of  a  scope.  Instead,  the  attempt  to  match  a  scope  prefix  with 
the  viable  prefix  should  be  made  using  the  relevant  segment  of  the  state  sequence  that  is 
in  stack  and  the  relevant  segment  that  is  in  tempjtack.  Since  the  application  of  a  scope 
never  extends  the  stack,  one  can  simply  s>ubstitute  stack  for  scope_stack  in  the  recursive 
call  on  hne  19. 


Mf  the  action  in  9m  on  i4  b  a  gotoreduce.  the  parser  is  simulated  through  the  whole  se<|uence  of  goto- 
reduce  actions  thai  follow,  until  a  goto  action  is  encountered.  This  final  goto  is  executed  and  the  resulting 
state  sequence  is  used  instead.  iNote  thai  these  actions  do  not  consume  anj-  input  symbol. 
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—  Lei  scope_sequence  and  state^een  be  global  variables,  scope-trial  is  invoked  wiih  the  sequence 

—  of  slates:  stack— q^,  ...  ,qm  «n  the  error  configuration.    The  input  sequence  ti, ...  ,t„  is  assumed 

—  to  be  global.  Initially,  scope_sequence=[  ]  an rf  state jeen=[0      i  6  [1  ..m]]. 
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3 
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10 

11 

12 

13 

14 

15 
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28 
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32 

33 


proc  scope  Jna/(stack); 

if  (gi  :=  stack(#stack))  €  state_seen(#stack)  then 

return; 
end  if: 

state.seen(#stack)  :=  state^een(#stack)  U  {qi}. 
for  each  scope  (tt,  ,  a,,  a,,  A,,  Q, )  loop 
scope_stack  :=  stack. 

act  :=  /oo/laAea(y_ac<ion(scope_stack,a,,pos) 
if  act  -^  error  then 

scope_stack(pos+l  .)  :=  tempjtack(pos+l..). 

top  :=  #scope^tack  -  |3-,|; 

if  top  >  0  and  top  <  min(#stack,  pos+1)  then 

pref  :=  [in_sym(scope^tack(j))        j  in  top+1  .#scope_stack]; 
if  pref  =  TT,  and  scope_stack(top)  £  Q,  then 
until  act  is  a  not  goto-reduce  action  loop 
top  :=  top  -  RHS(act)  +  1. 
act  :=  GOTO(scope_stack[top].  LHS(act)): 
end  loop; 

if  parsf  _fAecJl(scope_stack(l  .top)+[act],  /j /„)  >  min.distance  then 

scope_sequence  :=  [i], 
return; 
else 

scopt  Jria/(scope_stack), 

if  scope-sequence  ^  []  then 

scope-sequence  ;=  scope-sequence  -(-  [i]; 
end  if. 
return, 
end  if. 
end  if. 
end  if. 
end  if. 
end  loop, 
end  scopt  Jnal. 


Figure  4.8:  scopt  Jnal  procedure 
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4.3.3.3     User-Supplied  Scopes 

This  scope  recovery  method  can  be  extended  to  accommodate  any  kind  of  multiple  insertion 
of  symbols  by  allowing  the  user  to  specify  his  own  "scopes'".  Scopes  are  explicitly  specified 
with  productions  that  use  the  special  error  symbol,  %trror,  as  a  right-hand  side  marker  to 
separate  the  scope  prefix  from  its  suffix.  These  productions  are  called  user-supplied  scoptd 
rules  and  the  scopes  derived  from  them  are  called  user-supplied  scopes. 

The  parser  generator  upon  scanning  a  rule  with  %error  in  its  right-hand  side  identifies 
such  a  rule  as  a  scope  whose  prefix  consists  of  the  symbols  preceding  %error  and  whose 
suffix  consists  of  the  symbols  following  it.  The  symbol  %error  is  used  as  the  lookahead 
for  the  scope  but  it  is  neither  included  in  the  prefix  nor  the  suffix.  Therefore,  a  single  rule 
should  never  be  used  to  specify  more  than  one  user-supplied  scope  because  that  will  cause 
the  suffix  of  some  scopes  to  contain  the  special  %error  symbol. 

When  specifying  scopes,  special  attention  must  be  paid  to  the  kinds  of  symbols  that  are 
likely  to  be  omitted  by  a  programmer.  For  example,  consider  an  Ada  record_type_definition 
and  the  context  in  which  it  is  used: 

fulLtypcdeclaration         —      type  identifier  [discriminant^art]  is  type.definition  ; 
type-definition  -~      . . .  \    record_type_definition    |    ... 

record.type.definition      —      record 

componentJist 
end  record 

Even  though  the  above  rule  for  record_type_definition  is  terminated  by  an  "end  record" 
sequence,  it  is  not  a  scope,  because  in  Ada,  no  string  containing  recordJype.definition  can 
be  derived  from  componentJist.  However,  simply  duplicating  the  recordjype.definition  rule 
to  identify  "end  record"  as  a  scope  closer  is  unlikely  to  be  effective  since  a  programmer 
who  omits  this  sequence  would,  most  likely,  also  omit  the  terminating  semicolon.  In  order 
to  identify  the  complete  closing  sequence  one  must  respecify  the  whole  fulLtype.declaration 
for  a  recordjype.definition  as  follows: 

fulLtype.declaration      —      type  identifier  [discriminant^art]  is 

record 

componentJist 

Vitrror 
end  record  ; 

During  normal  parsing,  user-supplied  scoped  rules  are  never  reduced  except  when  used 
for  scope  recovery  because  7{error  is  only  used  internally  the  error  recovery  procedure  and 
is  never  recognized  as  an  input  symbol  by  the  lexical  analyzer.  When  detecting  scope 
errors  and  applying  scope  recovery  at  run  time,  user-supplied  scopes  are  treated  in  the 
same  manner  as  the  automatic  ones. 

4.4     Recovery  Phases 

This  section  describes  how  the  different  error  recovery  strategies  discussed  in  the  previous 
sections  are  incorporated  into  the  unified  two-phase  scheme  of  this  method.  Error  diagno- 
sis, error  repair  and  a  suitable  data  structure  for  implementing  and  managing  the  input 
stream  are  also  discussed. 


It  is  assumed  that  the  driver  with  3  deferred  tokens  of  Figure  4.5  is  used.  This  driver  can 
detect  an  error  either  on  curtok  or  on  the  successor  of  curtok  which  will  be  denoted  succtok. 
When  the  parsing  starts  (or  restarts  after  a  recovery),  if  curtok  is  in  error,  the  parser  stops 
immediately  and  invokes  the  error  recovery  routine  with  a  single  configuration,  namely, 
the  configuration  prior  to  the  execution  of  any  action  on  curtok.  The  slate  sequence  of  this 
error  configuration  is  contained  in  state_stack.  If  the  parser  can  successfully  process  curtok 
but  fails  on  succtok,  the  error  recovery  routine  has  access  to  both  the  curtok  configuration 
and  the  configuration  prior  to  the  execution  of  any  action  on  succtok.  The  state  sequence 
of  this  configuration  is  contained  in  next^stack.  If  the  parser  is  able  to  parse  at  least  2 
tokens  successfully  before  an  error  is  detected,  the  error  is  detected  on  succtok,  and  in 
addition  to  the  curtok  and  succtok  configurations,  the  configuration  of  the  parser  prior 
to  the  execution  of  any  action  on  prevtok  is  also  available.  The  state  sequence  of  this 
configuration  is  contained  in  previousjsiack. 

At  the  global  level,  the  effectiveness  of  a  recovery  trial  is  measured  based  on  two  criteria: 

•  the  number  of  symbols  that  must  be  deleted  if  the  repair  in  question  is  applied 

•  the  parst. check  distance  of  the  recovery 

The  primary  phase  recovery  which  includes  all  recovery  trials  that  are  based  on  at 
most  a  single  input  token  modification  is  attempted  first.  If  a  successful  primary  phase 
recovery  is  found  that  cannot  be  beaten  by  any  other  recovery  in  terms  of  the  criteria 
above,  it  is  accepted.  If  such  a  primary  phase  recovery  is  not  found,  secondary  phase 
recovery  is  attempted.  If  a  successful  secondary  phase  recovery  is  found,  then  it  is  accepted. 
Otherwise,  the  error  recovery  gets  into  a  form  of  "panic  mode",  where  the  current  input 
buffer  is  flushed,  new  input  tokens  are  read  in  and  secondary  phase  recovery  is  attempted 
again.  This  process  is  repeated  until  either  a  successful  phrase-level  recovery  is  obtained 
or  the  end  of  the  input  stream  is  reached. 

W'lien  the  error  recovery  procedure  finds  a  successful  recovery,  it  issues  a  diagnosis  for 
it  and  repairs  the  configuration  before  returning  control  to  the  parser. 

4.4.1      Primary  Phase 

In  tiie  primary  phase,  error  recovery  is  applied  on  eacli  available  configuration,  starting 
with  next_stack.  proceeding  with  state_stack  and  finally  processing  previous jtack. 

For  each  configuration,  scope  recovery  is  attempted  first  followed  by  simple  recovery. 
The  same  criteria  used  in  choosing  a  simple  recovery  is  used  in  the  primary  phase.  The 
misspelling  index  of  a  scope  recovery  trial  is  set  to  1.0.  Thus,  for  a  given  configuration, 
a  successful  scope  recovery  always  has  priority  over  a  simple  recovery  trial  that  yields  the 
same  parse.check  distance. 

If  a  successful  recovery  is  obtained  from  the  primary  phase  and  its  slack  configuration 
is  next_stack  or  statejtack,  the  recovery  trial  is  evaluated  agcunst  certain  phrase-level  re- 
covery trials  on  the  stack  configuration  in  question  before  being  accepted.  The  phrase-level 
recovery  trials  in  question  are  the  ones  whose  repair  actions  would  have  as  little  impact 
on  the  recovery  configuration  as  a  simple  recovery.  They  are  misplacement  recovery  trials 
and  scope  recovery  trials  that  require  the  deletion  of  a  single  input  token.  The  idea  is  to 
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ensure  that  none  of  these  borderline  recoveries  can  be  more  effective  than  the  best  primary 
phase  recovery. 

4.4.2  Secondary  Phase 

In  the  secondary  phase,  phrase-level  recovery  is  applied  first  on  the  next-stack  configuration 
if  it  is  available  and  then  on  the  state-stack  configuration.  Phrase- Level  recovery  is  never 
attempted  on  the  previous-stack  configuration,  because  it  is  not  always  possible  to  issue 
an  accurate  diagnosis  for  a  recovery  found  in  that  configuration  (see  Section  4.4.3)  or  to 
properly  repair  that  configuration  (see  Section  4.4.5). 

If  a  successful  phrase-level  recovery  is  obtained,  a  check  is  made  to  see  if  the  error  can 
be  better  repaired  by  the  closing  of  some  scopes  foDowed  by  less  radical  surgery.  Consider 
the  following  Pascal  example: 

ii   count [listdata [sub]    :=   0  then 
X    :=   ((   3  ]]; 

In  the  first  line,  the  programmer  is  missing  a  closing  "]"  and  the  assignment  operator 
":="  is  used  instead  of  a  relationzd  operator.  This  error  is  detected  on  the  symbol  ":=". 
In  the  second  line,  the  programmer  used  two  "]'"  instead  of  ")"'  to  close  a  parenthesized 
expression  and  the  error  is  detected  on  the  first  "]".  Nothing  short  of  a  phrase  deletion 
of  the  sequence  "[listdata[sub]  :=  0'"  in  the  first  instance  and  a  phrase  substitution  of 
"expression"  for  the  sequence  "((  3  ]]'"  would  successfully  repair  these  errors.  However, 
it  is  not  difficult  to  see  that  they  can  be  repaired  more  accurately,  using  scope  recovery 
by  proceeding  as  follows.  Before  accepting  a  phrase-level  recovery  based  on  an  error 
phrase  ;3\x.  a  scope  recovery  check  is  performed  on  the  recovery  configuration,  followed  by 
the  deletion  of  up  to  |j|  tokens  in  the  right  context.  If  the  scope  recovery  is  successful, 
then  its  associated  repair  actions  are  applied  without  the  subsequent  deletion  and  the 
secondary  phase  returns  successfully.  The  parser  fails  right  away  and  once  again  invokes 
the  error  recovery  procedure.  On  this  next  round,  primary  and  secondary  phase  recovery 
are  attempted  again.  This  subsequent  attempt  will  at  best  fix  the  remaining  input  or 
at  worst  delete  x  from  the  input.  In  the  example  above,  the  missing  "l"  is  inserted  and 
"relationaLoperator"  is  substituted  for  ":="  in  the  first  line.  In  the  second  line,  two  closing 
*■)"  are  inserted,  followed  by  a  deletion  of  the  pair  '"]]"'  (See  figure  4.2). 

4.4.3  Error  Diagnosis 

In  order  to  accurately  diagnose  an  error,  one  must  identify  the  location  of  the  tokens 
that  are  in  error.  This  is  straightforward  for  a  simple  recovery  since  such  a  recovery 
involves  the  modification  of  one  or  two  input  tokens  and  the  location  of  each  token  is 
avaiilable.  Recall  that  each  state  in  the  state  stack  is  associated  with  the  location  of  the 
first  token  on  which  an  action  was  executed  in  that  state.  Thus,  given  an  error  configu- 
ration 9i 9m  I  /i,..  .,/„,  if  a  successful  phrase-level  recovery  based  on  an  error  phrase 

9,+] , . .  .,9m  I  'ii  •  •  -I'j-i  of  that  configuration  is  found,  the  location  of  the  first  token  of 
this  error  phrase  is  the  location  associated  with  the  recovery  slate  q,.  If  the  state  sequence 
associated  with  the  error  configuration  is  state_stack  then  the  symbol  U  is  cvrtok-.  In  that 
case,  if  j  =  1,  Iq  is  prevtok.  Similarly,  if  the  slate  sequence  is  next_stack  and  j  =  1  then 
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/i  =  succtok  and  to  =  curtok.  Finally,  if  the  state  sequence  is  previous-Stack  and  j  =  1  then 
<i  =  prevtok  and  to  is  undefined.  Therefore,  as  mentioned  earlier,  if  phrase-level  recov- 
ery is  attempted  on  the  previous_stack  configuration,  one  cannot  identify  the  last  symbol 
of  an  error  phrase  that  does  not  contain  any  input  symbol  unless  the  predecessor  of  the 
previous  token  is  also  kept.  When  a  scope  recovery  associated  with  a  scope  (tt.  a.a.  .4.Q) 
is  applicable  on  an  error  configuration  like  the  one  above,  it  can  be  viewed  as  a  secondary 
substitution  of  the  nonterminal  A  for  the  error  phrase  q,+^,. .  .,Qm  \  £■,  where  |7r|  =  m. 
That  is.  the  location  of  the  first  token  of  this  error  phrase  is  the  location  associated  with  9, 
and  the  location  of  the  last  token  is  the  location  of  to  as  described  above.  (Note  that  this 
implies  that  the  location  of  the  last  symbol  in  a  scope  error  phrase  of  the  previous.^tack 
configuration  cannot  be  accurately  determined.) 

Once  the  location  of  the  error  token  or  error  phrase  has  been  identified,  an  error 
diagnosis  message  is  issued  describing  the  repair.  The  diagnosis  of  a  simple  recovery  repair 
is  straightforward  except  that  for  an  insertion  or  substitution  some  preprocessing  of  the 
repair  token  is  required  before  the  message  is  emitted.  This  will  be  discussed  later.  To 
diagnose  a  phrase  deletion,  the  user  is  advised  to  delete  the  symbols  in  the  error  phrase  in 
question.  For  a  phrase  substitution;  if  the  relevant  reduction  goal  is  a  nullable  nonterminal, 
the  diagnosis  is  treated  like  a  phrase  deletion.  Otherwise,  the  reduction  goal  is  suggested 
as  a  replacement  for  the  error  phrase.  To  diagnose  a  scope  recovery,  the  u.ser  is  advi.sed  to 
insert  the  symbols  of  the  scope  suffix  in  question  after  /q-  if  the  location  of  /o  is  defined,  or 
before  /).  otherwise.  In  addition,  it  is  very  useful  to  identify  the  starting  location  of  the 
scope  which  can  be  computed  from  the  recovery  state. 

The  main  goal  of  an  error  message  is  to  inform  the  programmer  as  to  how  the  input 
source  was  modified.  However,  it  is  also  desirable  to  issue  error  messages  that  give  the 
user  an  accurate  diagnosis  of  the  error.  In  particular,  whenever  a  symbol  A'  is  inserted 
into  the  text  or  it  is  substituted  for  an  error  token  (by  simple  recovery)  or  for  an  error 
phrase  (by  phrase-level  recovery),  the  symbol  reported  in  the  message  is  the  highc-t-Uvtl 
f.yniboiy  that  can  subsume  A'.  That  is,  a  symbol  Y  is  used  instead  of  A.  if  assuming  0 
is  the  viable  prefix  corresponding  to  the  state  sequence  of  the  error  configuration  and  u' 
is  the  remaining  input.  oVic  =>'„  qX w  and  V5  €  A  such  that  B  ^  Y,  oBir  56+^,  oYv. 
Consider  the  following  erroneous  Pascal  statement: 

il   n   <=   then  POWER    :=   else  POWER    :=   2; 

This  Statement  is  missing  a  subexpression  after  "<=■"  and  an  expression  after  the  first 
":-" .  Since  simple  recovery  is  attempted  first  on  terminal  symbols  and  any  of  the  symbols: 
identifier,  NIL,  string.literal,  integerJiteral  or  realJiteral  can  be  reduced  to  a  subexpression, 
one  such  symbol  will  be  inserted  into  the  text  to  repair  each  of  these  two  errors.  Depending 
on  the  semantic  context,  the  arbitrary  insertion  of  a  symbol  can  be  misleading  even  thougli 
it  is  syntaclicaUy  correct.  For  example,  it  is  clear  from  the  example  above  that  POWER  is 
not  a  pointer  variable.  Nonetheless.  NIL  is  a  valid  candidate  that  may  be  inserted  after 
the  first  ":=".  In  reporting  such  an  error,  if  the  highest-level  symbol  associated  with  the 
repair  candidate  is  used,  simplejexpression  will  be  suggested  as  an  insertion  candidate  after 
"<="  and  expression  will  be  suggested  as  an  insertion  candidate  after  the  first  '*:='".  (See 
Pascal  BNF  definition  in  [11].) 

The  computation  of  the  highest-level  symbol  is  straightforward.  Starting  in  the  recovery 
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function  bighesi.symbol{stack,  X,  <); 
hs  :=  X; 

q  :=  stack(#stack); 
if  A'  €  A'  then 

act  :=  GOTO(q,  X); 
else  if  (act  :=  ACTION(q,  X))  is  not  a  shtft  or  shifi-reduce  action  then 
act  :=  lookah€ad-aciion{stack,  X,  pos); 
q  :=  temp^tack{#temp^tack); 
end  if: 

temp^tack  :=  [q]; 

if  act  25  a  shift  or  goio  action  then 
temp^tack  :=  temp^tack  +  [act]; 
act  :=  ACTION(act,  /)■ 
end  if, 
top  :=  1: 
while  act  is  a  reduce  action  loop 

until  act  is  not  a  goio-reducf  action  loop 
top  :=  top  -  RHS(act)  +  1. 
if  top  <  1  then 
return  hs; 
elseif  top  =  1  then 
hs  :=  LHS(act); 
end  if. 

act  :=  GOTO(temp^tack(top),  LHS(act)); 
end  loop 

temp^tack(top+l)  :=  act, 
act  :=  ACTION(act.  t). 
end  loop 
return  hs; 
end  high  est. symbol. 


Figure  4.9:  highest. symbol  function 


state  one  simply  simulates  the  steps  of  the  parser  until  a  slate  q  is  entered  wliere  the 
candidate  A'  can  be  shifted.  N'ext,  starting  with  state  q  as  the  only  state  in  a  slack  and 
using  the  next  input  symbol  t  as  lookahead,  A'  is  shifted  and  all  reduce  actions  induced 
by  t  and  their  associated  goto  actions  are  applied  until  the  parser  wants  to  shift  t  or 
reduce  below  q.  At  that  point,  the  last  symbol  on  which  a  transition  was  made  in  q  is  the 
highest-level  symbol  that  subsumes  A'.  Figure  4.9  is  an  implementation  of  this  algorithm. 

4.4.4     Error  Repair 

A  repair  is  applied  by  resetting  the  components  of  the  main  configuration:  the  input  buffer 
and  state^ack. 

The  resetting  of  the  input  buffer  involves  the  insertion  of  some  symbols  into  the  buffer, 
the  reading  of  new  input  tokens  into  the  buffer,  or  the  replacement  of  some  buffer  elements. 
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Figure  4.10:  Initial  buffer  configuration 


A  good  scheme  for  managing  the  input  buffer  is  discussed  in  the  next  section. 

To  reset  statejstack,  the  first  step  is  to  initialize  it  with  the  proper  state  sequence  if 
the  successful  recovery  was  found  on  previous^tack  or  next^tack.  For  a  simple  recovery, 
nothing  more  is  required.  For  a  phrase-level  recovery,  all  states  following  the  recovery  state 
are  removed  from  the  stack.  Similarly,  for  a  scope  recovery,  the  sequence  of  .states  on  top 
of  the  stack  that  corresponds  to  the  prefix  of  the  scope  in  question  is  removed  and  the 
repair  proceeds  as  if  the  error  was  a  simple  insertion  of  the  left-hand  side  of  the  scope. 

4.4.5     The  Input  Buffer 

The  input  buffer  can  be  implemented  as  a  fixed-size  circular  queue  containing  the  previous 
token  and  the  next  r?  tokens  to  be  processed,  for  some  fixed  integer  n.  prtvtok  is  a  pointer 
variable  that  identifies  the  previous  token  element  in  the  circular  queue.  Its  successor  in 
the  queue  is  identified  by  curtok  which  is  also  a  pointer  variable.  Initially,  the  prtvtok 
element  is  undefined  and  the  next  n  tokens  from  the  input  are  read  into  the  remaining 
elements.  See  figure  4.10. 

If  the  input  is  correct,  the  buffer  is  processed  as  follows.  Each  time  the  parser  executes 
a  shift  (or  shift-reduce  action),  the  next  input  token  is  read  into  the  prtvtok  element, 
prevtok  is  set  to  curtok  and  curtok  is  updated  to  point  to  its  successor.  This  process 
continues  until  the  parse  terminates.  (When  the  end  of  the  input  source  is  reached,  the 
iexjcal  analyser  is  expected  to  keep  returning  J..) 

Note  how  this  scheme  easily  accomodates  the  general  case  of  an  LALR(k)  parser  with 
variable  lookahead-strings  described  in  the  previous  chapter.  If  the  parser  needs  to  perform 
more  lookahead,  after  having  consulted  a  given  element  of  the  buffer,  it  obtains  the  next 
iookahead  symbol  by  simply  moving  to  the  next  element  in  the  queue.  Of  course,  this 
implies  that  the  size  of  the  input  buffer  queue  must  be  greater  than  or  equal  to  k. 

When  an  error  is  encountered,  the  reparation  of  the  configuration  usually  involves 
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Figure  4.11:  Buffer  configuration  after  insertion  of  1  in  front  of  /q 


some  modification  of  the  input  buffer.  In  particular,  it  may  involve  the  insertion  of  some 
symbols.  If  a  repair  calls  for  the  insertion  of  a  symbol  in  front  of  curtok  or  prevtok.  the 
buffer  ovtrflows.  To  accomodate  such  a  repair,  two  extra  overflow  elements  are  needed. 
They  are  identified  by  two  pointers:   overflow]  and  overflow2.    Let's  consider  the  worst 

case,  where  the  buffer  contains  a  sequence:   ^./l, /r,  -  where  <o  is  the  previous  token. 

and  the  repair  modification  is  to  insert  a  symbol  A'  in  front  of  to.  In  that  case,  the  previous 
token  {q  is  copied  into  the  overflows  element  whose  successor  is  set  to  be  curtok:  the  new 
symbol  A'  is  placed  in  the  overflow!  element  whose  successor  is  set  to  be  overflow^  and 
curtok  is  set  to  overflow! .  Figure  4.11  illustrates  this  case. 

When  executing  a  shift  action  on  an  overflow  element,  the  parser  simply  updates  curtok 
to  point  to  its  successor  and  no  new  tokens  are  read  in.  When  the  parser  reenters  the  queue, 
the  processing  continues  a^  before,  and  the  next  shift  action  updates  the  previous  pointer. 
Note  that  this  approach  assumes  that  no  error  will  occur  on  an  overflow  element.  For 
normal  recoveries,  this  can  be  guaranteed  by  min.distance  being  set  to  a  value  greater  than 
or  equal  to  2.  However,  this  restrinction  explains  the  main  reason  why  secondary  phase 
recovery  is  not  attempted  on  the  previous_stack  configuration.  Recall  from  section  4.4.2 
that  in  some  cases,  instead  of  applying  a  phrase-level  recovery,  a  scope  recovery  is  applied 
and  control  is  returned  to  the  parser  with  a  configuration  that  will  declare  an  error  on 
the  next  input  symbol.  In  such  a  case,  if  previous^tack  is  the  state  sequence  that  is  used, 
then  after  the  repair  is  applied,  the  input  buffer  is  placed  in  a  configuration  similar  to  the 
configuration  of  Figure  4.11,  where  the  symbol  A'  is  the  left-hand  side  of  the  scope.  Thus, 
after  processing  A',  the  parser  will  detect  an  error  on  an  overflow  element. 

There  are  other  repziir  cases  where  one  or  both  overflow  elements  are  needed  in  the 


resetting  of  the  input  buffer.  The  detail  of  these  cjises  is  left  to  the  reader. 
4.5      Remarks 

This  error  recovery  method  has  been  successfully  implemented.  Parsers  were  built  for 
Ada  and  Pascal  and  tested  on  the  Ada  examples  of  [37]  and  the  Pascal  examples  of  [20]. 
(Appendix  E  shows  some  of  these  results.)  Penello  and  DeRemer  [17]  proposed  that  the 
quality  of  a  repair  be  rated  "excellent"  if  it  repaired  the  test  as  a  human  reader  would 
have,  "good"  if  it  resulted  in  a  reasonable  program  and  no  spurious  errors,  and  "poor"  if 
it  resulted  in  one  or  more  spurious  errors.  Based  on  these  categories,  the  performance  of 
this  method  on  the  test  set  of  [20]  was  85.9%  excellent.  14.1%  good  and  0.0%  poor.  In 
fact,  most  of  the  "good"  recoveries  resulted  from  errors  whose  repair  required  some  kind  of 
semantic  judgement.  The  time  performance  of  this  method  is  exceUent,  usually  requiring 
less  than  50  milliseconds  per  error  on  a  16  MHz  PS/2  model  80. 
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Chapter  5 

Table  Optimization  and 
Generation 


In  generaJ,  one  of  the  main  objectives  that  must  be  achieved  in  generating  parsing  tables 
is  that  they  must  be  time-  and  space-efficient.  In  addition,  if  the  resulting  parser  must 
recover  from  input  errors,  the  tables  generated  must  retain  enough  important  information 
about  the  input  grammar  and  its  LR(0)  automaton  to  accommodate  error  recovery.  Un- 
fortunately, these  two  goals  are  sometimes  conflicting.  For  example,  as  mentioned  earlier, 
when  space-saving  optimizations  such  as  default  reductions  (and  others  that  will  be  de- 
scribed in  this  chapter)  are  used,  the  resulting  parsing  tables  lose  certain  information  about 
the  automaton.  Space-saving  optimizations  are  necessary  in  order  to  reduce  the  parsing 
tables  to  an  acceptable  size  but  the  information  they  lose  (such  as  the  viable  candidates 
in  a  given  state)  is  usually  important  to  insure  good  recoveries. 

In  the  past,  authors  who  have  studied  LR  error  recovery  have  proposed  that  some 
tradeoffs  be  made  regarding  space-efficiency,  time-efficiency  and  quabty  of  recoveries.  For 
example,  in  [21].  it  is  suggested  that  default  reductions  not  be  used  in  certain  states  to 
prevent  premature  reductions  in  these  states  and  that  only  terminal  symbols  on  which 
transitions  are  defined  in  a  given  state  be  considered  as  simple  recovery  candidates  (this 
information  must  be  available  in  the  parsing  tables).  In  [37].  a  deferred  parsing  technique 
using  two  parsers  is  used  to  avoid  premature  reductions  and  all  terminal  symbols  are  con- 
sidered as  possible  candidates  during  simple  recovery;  in  effect,  time-efficiency  is  sacrificed 
in  order  to  gain  space-efficiency  and  obtain  good  recoveries. 

By  contrast,  the  approach  taken  in  this  method  is  to  treat  the  issue  of  table  compaction 
and  optimization  for  parsing  separately  from  the  issue  of  table  compaction  and  optimization 
for  error  recovery.  The  data  for  each  activity  is  aggressively  optimized  and  compacted, 
separately.  The  result  is  that  the  amount  of  space  used  for  each  of  these  two  sets  of  tables 
is  very  small,  (usually,  the  toted  amount  of  space  used  is  much  smaller  than  it  would  be 
for  a  single  set  of  parsing  tables  in  other  methods).  Nonetheless,  these  tables  are  very 
time-efficient  and  no  useful  error  recovery  information  is  lost. 
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5.1      Parsing  Tables 

As  described  earlier,  an  LALR(A-)  parser  with  varying-length  lookahead  strings  can  be 
represented  by  a  pair  of  parsing  tables  in  matrix  form:  ACTION  and  GOTO.  Each  row  of 
a  parsing  table  is  labeled  with  a  distinct  state  of  the  parser.  Each  column  of  ACTION  is 
labeled  with  a  terminal  symbol  and  each  column  of  GOTO  is  labeled  with  a  nonterminal 
symbol.  The  entry  in  a  parsing  table  for  a  given  row  and  column  is  the  parsing  action 
associated  with  the  corresponding  state-symbol  pair.  In  essence,  the  matrix  is  used  to 
store  the  characteristics  of  the  pushdown  automaton  that  obeys  the  grammatical  rules 
of  the  language.  In  terms  of  speed  performance,  the  matrix  is  an  efficient  method  for 
representing  an  LR  parsing  table.  Unfortunately,  such  a  matrix  is  usually  very  sparse, 
requiring  an  excessive  amount  of  space  relative  to  the  number  of  useful  entries  it  contains. 
For  example,  an  LALR(l)  parser  for  the  Ada  language  with  541  states  and  403  symbols 
only  contained  4151  useful  entries,  thus  utilizing  only  1.9%  of  the  matrix. 

An  LR  automaton  can  £ilso  be  represented  as  a  directed  graph,  where  each  vertex 
of  the  graph  represents  a  state  of  the  automaton  and  each  edge,  labeled  with  a  symbol, 
represents  transitions  between  states.  (In  this  representation,  one  must  be  able  to  dif- 
ferentiate between  lookahead  states  and  regular  states.  In  addition,  special  reduct  states 
(called  "lookahead  states"  in  [6])  and  shift-reduce  states  must  also  be  used  for  reduce  and 
read-reduce  actions,  respectively.)  This  representation  has  the  advantage  of  compactness 
since  it  needs  only  be  as  large,  proportionally,  as  the  number  of  significant  entries  in  the 
corresponding  matrices.  However,  the  computation  of  an  action  using  this  representation 
is  slow  since  it  may  require  that  each  out-edge  connected  to  a  state  be  explored. 

From  this  discussion,  one  gathers  that  a  desirable  goal  in  generating  LR  parsing  tables 
is  to  find  a  representation  that  performs  as  well  as  the  matrix  in  terms  of  time-efficiency 
but  whose  space  requirement  is  close  to  that  of  the'graph  representation. 

A  number  of  methods  are  known  in  the  art  for  the  compression  of  LR  parsing  tables. 
The  compression  can  be  achieved,  for  example,  by  the  use  of  hashing,  linear  lists  [3-3]. 
row-displacement  [33.  30,  23.  15].  or  graph-coloring  [30]. 

Hashing  is  a  standard  technique  for  storing  and  searching  large  sparse  tables,  but  it  is 
seldom  used  for  LR  parsing  applications  because  of  its  poor  worst-case  performance.  It 
also  consumes  a  large  amount  of  space. 

Substantial  space  savings  result  when  the  significant  entries  in  each  row  of  a  matrix 
are  stored  in  a  linear  list.  The  list,  however,  must  be  searched  sequentially  when  a  parse 
action  is  needed.  Therefore,  the  time  required  to  determine  a  parse  action  is  not  constant, 
but  depends  on  the  number  of  significant  entries  in  the  state  in  question.  This  method, 
discussed  in  [33].  does  save  space  but  at  the  expense  of  time. 

In  the  row-displacement  method,  the  rows  of  a  sparse  matrix  are  ot^erla id  on  each  other 
in  a  one  dimensional  table  and  an  auxiliary  table  is  used  to  retrieve  the  starting  index  of 
each  row  in  the  overlay  table.  Each  entry  in  the  overlay  table  contains  a  chfck  field  that 
is  used  to  verify  whether  or  not  that  entry  corresponds  to  the  original  useful  entry  in  the 
matrix.  This  method  is  advocated  by  Aho.  Sethi  and  UUman  in  [33]  and  by  Ziegler  in  [15], 
and  is  discussed  in  detail  in  [30]  and  [23].  It  is  also  used  together  with  list  searching  in 
the  Y.ACC  parser  generator,  as  described  in  [33].  The  row-displacement  method  does  well 
with  respect  to  lime-efficiency,  but  its  space  utilization  is  not  always  optimal  unless  the 
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matrix  in  question  has  a  "harmonic  decay"  property  [23]  and  a  first-fit  decreasing  heuristic 
is  used. 

In  the  GCS  (graph-coloring  scheme)  method  proposed  by  [30],  the  rows  of  the  original 
matrix  are  partitioned  into  classes  of  rows  that  do  not  have  different  significant  entries  in 
any  column  position.  These  rows  are  then  merged  to  form  a  shorter  matrix.  Next,  the 
columns  of  this  new  matrix  are  partitioned  and  merged,  using  the  same  method,  to  obtain  a 
final  reduced  matrix.  A  vector  rowmap  indexable  by  the  row  indexes  of  the  original  matrix 
and  a  vector  co/umnmap  indexable  by  the  column  indexes  of  the  original  matrix  are  used  to 
map  the  row  and  column  indexes  of  the  original  matrix  into  the  row  and  column  indexes  of 
the  reduced  matrix.  In  addition,  w^hen  this  method  is  used  to  compress  the  ACTION  table 
of  an  LR  parser,  a  boolean  matrix  sigmap.  indexable  by  the  indexes  of  the  original  matrix, 
is  required  to  validate  entries  in  the  reduced  matrix.  If  the  boolean  values  in  sigmap  are 
contained  in  a  single  bit,  this  method  yields  relatively  small  parse  tables  where  actions  can 
be  computed  with  a  constant  number  of  operations.  However,  since  accessing  a  bit  element 
of  a  matrix  is  usucdly  a  complex  operation  in  most  machines  and  sigmap  must  be  accessed 
for  each  terminal  action,  the  time-performance  of  this  method  may  not  be  so  good. 

In  the  method  presented  here,  the  compaction  of  LR  parsing  tables  is  performed  in 
three  steps.  In  the  first  step,  a  number  of  transformations  are  applied  on  the  original 
parsing  tables  to  significantly  reduce  the  number  of  rows  and  useful  entries  in  them.  (The 
resulting  tables  usually  have  the  harmonic  decay  property.)  In  the  second  step,  row- 
displacement  is  used  to  compact  the  resulting  transformed  matrices.  The  final  step  is  a 
clean-up  step  where  the  actions  in  the  compressed  table  are  updated  and  all  auxiliary 
tables  are  eliminated.  This  approach  often  yields  perfectly  compacted  tables  that  require 
a  fixed  number  of  primitive  operations  on  integers  to  compute  an  action. 

5.1.1      The  Row-Displacement  Scheme 

Let  .4  be  a  sparse  matrix  with  mj  rows.  rT?2  columns  and  n  useful  entries.  Such  a  matrix 
A  is  usually  stored  in  row-major  order  as  a  one-dimensional  array  A'  containing  mi  x  t7?2 
elements.  In  this  representation,  each  element  A{i,j)  of  the  matrix  corresponds  to  the 
element  A'((i  -  1)*  rrj]  +  j)  of  the  one-dimensional  array.  Let  row  be  an  auxiliary  one- 
dimensional  array  of  n?i  elements,  each  containing  the  offset  of  its  corresponding  row 
in  the  matrix:  i.e..  rouii)  =  ({j  -  l)*n?i).  1  <  7  <  m-^.  Using  rou\  an  element  A(j.j) 
corresponds  to  the  element  A'(row{!)  +  j). 

The  row-displacement  scheme  is  a  method  for  compressing  a  sparse  matrix  A  into  two 
parjJlel  one-dimensional  arrays,  CHECK  and  INFO,  with  fewer  positions  than  A'.  Given 
a  pair  {i.j).  the  CHECK  array  is  used  to  test  whether  or  not  that  pair  is  associated  with 
a  useful  entry  in  A.  If  it  is,  the  relevant  information  is  retrieved  from  the  INFO  array. 

The  compaction  is  performed  by  overlapping  the  rows  of  A  and  placing  their  values 
in  INFO  in  such  a  way  that  no  two  useful  entries  end  up  in  the  same  position.  The 
algorithm  can  be  stated  more  formally  as  follows.  For  each  row  ?  in  A.  a  sequential 
search  is  performed  on  the  elements  of  INFO  until  a  set  of  valid  positions  is  found  that  can 
accommodate  the  useful  entries  of  row  i.  Next,  the  I'th  entry  of  the  rou-  vector  is  initialized 
with  the  displacement  of  row  j  in  INFO:  each  useful  entry  A(i,j)  is  placed  in  its  assigned 
location.  INFO(rou'(?)  +  j).  and  the  corresponding  CHECK( rou-(0  +  j)  element  is  set  to  ? 
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Figure  5.1:  A  matrix  A  and  its  compressed  representation 
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Figure  5.2:  A  finite  state  automaton  and  its  compressed  representation 


to  indicate  that  the  element  rou'(i)  +  j  of  INFO  is  an  element  of  row  i  in  A.  This  aJgorithm 
is  known  as  the  "^first-fit"  method.  (It  is  NP-complele  to  find  a  set  of  displacements  that 
minimizes  the  size  of  the  compressed  vector  [23]  [16]). 

Figure  5.1  depicts  a  matrix  A  and  a  possible  compressed  representation  for  that  matrix. 
Non-useful  entries  in  A  contain  the  value  0.  The  columns  of  ^  are  labeled  with  the  sequence 
of  letters  [a,b,c,d,e]  to  avoid  confusion.  These  indexes  can  easily  be  mapped  into  the 
integers  [0..4]  as  the  row-displacement  scheme  assumes  that  the  indexes  of  the  matrix  are 
integers.  To  look  up  an  element  A{i,j),  one  checks  whether  or  not  CHECK(rou-(?)  +  j)  =  i. 
If  the  test  succeeds  then  INFO(rozi,'(?)  +  j)  is  the  relevant  value  o{  A{i,j):  otherwise.  A{i.j) 
is  not  a  useful  entry  and  0  is  the  relevant  value. 

Note  that  if  each  row  of  a  matrix  has  a  unique  displacement  in  its  compressed  repre- 
sentation then  the  column  indexes  of  the  matrix  can  be  used  as  check  values  instead  of  the 
row  indexes.  This  is  particularly  useful  if  mi  <C  m2  (say  m-i  <  2.56)  and  can  be  stored  in 
a  smaller  storage  unit  than  7772-  Note  also  that  the  lower  bound  of  the  compressed  vector 
does  not  have  to  be  1,  but  instead  can  be  an  arbitrarily  chosen  integer. 

Assume  that  the  matrix  A  of  Figure  5.1  is  a  table  representation  of  the  transitions  of  a 
finite  state  automaton  (Figure  5.2(a)).  Assume  that  state  1  is  the  root  state  and  as  before, 
the  edge  labels  [a.b,c,d.e]  are  mapped  into  the  integers  [0..4].  In  such  a  case,  as  was  stated 
earlier,  the  direct  access  capability  of  the  matrix  does  not  need  to  be  preserved  and  the 
auxiliary  displacement  vector  can  be  eliminated  by  replacing  each  useful  entry  in  INFO 
by  its  displacement  v2Jue.  See  Figure  5.2(b).  Of  course,  in  such  a  case  the  displacement 
of  the  starting  state  must  still  be  stored  in  an  auxiliary  Starl  variable. 

Ziegler  suggested  that  the  rows  of  a  matrix  to  be  compacted  with  the  row-displacement 
scheme  be  sorted  in  decreasing  order  by  the  number  of  useful  entries  they  contain  before 
applying  the  compaction  algorithm.  This  approach  is  known  as  the  "first-fit  decreasing" 
method.  Tarjan  and  Yao  [23]  showed  that  this  approach  always  produce^  a  perfectly 
compacted  vector  "if  the  distribution  of  nonzeros  among  the  rows  are  reasonably  uniform". 
Their  analysis  was  stated  as  follows: 

Our  intuition  is  that  rows  with  only  a  few  nonzeros  do  not  block  loo  many 
possible  displacement  values  for  other  rows;  it  is  only  the  rows  with  many 
nonzeros  that  cause  problems.  To  quantify  this  phenomenon,  we  define  ti{I). 
for  /  >  0,  to  be  the  total  number  of  nonzeros  in  rows  with  more  than  /  nonzeros. 
Our  first  theorem  shows  that  if  n{l)/n  decreases  fast  enough  as  /  increases,  then 
the  first-fit  decreasing  method  works  well. 

Theorem  5.1.1  Suppose  the  orrot/  A  (to  be  compressed)  has  the  follou-mg 
"harmonic  decay' property: 

H:  For  any  I,  n(/)  <  n/(/+  1). 

Then  every  row  displacement  rou:(i)  computed  for  A  by  the  first-fit  decreasing 
method  satisfies  0  <  row{i)  <  n. 

. . .  If  j4  has  harmonic  decay,  at  least  half  the  nonzeros  in  ^4  must  be  in  rows  with 
only  a  single  nonzero.  In  addition,  no  row  can  have  more  than  y/v  nonzeros. 
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5.1.2     Transformations 

Given  an  augmented  grammar  G  =  {N,T,P,S)  and  its  LALR(A-)  parser  with  varying- 
length  lookahead  strings  and  LR(0)  reduce  states  removed,  an  entry  in  the  ACTION  matrix 
of  such  a  parser  can  have  one  of  six  values.  They  are: 

•  shift  p 

•  reduce  A  —*  uj 

•  shift-reduce  A  -^  u 

•  la-shift  p' 

•  accept 

•  error 

where  p  is  an  LR(0)  state  of  the  parser,  A  -^  u;  £  P  and  p'  is  a  lookahead  state. 
The  error  parsing  action  is  a  constant  value  that  is  associated  with  each  element  of  the 
ACTION  matrix  whose  indexes  correspond  to  a  state-terminal  pair  on  which  no  action  is 
defined  in  the  automaton.  The  parsing  action  accept  is  also  a  constant  value.  When  an 
LR  parser  is  constructed  from  an  augmented  grammar,  accept  is  associated  with  a  single 
entry  in  the  matrix.  During  parsing,  accept  signals  the  successful  completion  of  a  parse. 

.A.n  entry  in  the  GOTO  table  can  have  one  of  three  values.  They  are: 

•  goto  p 

•  goto-reduce  A  —  w" 

•  don't  care 

where  p  and  A  —  uJ  are  defined  as  before.  A  non-useful  entry  in  the  GOTO  table  is 
referred  to  as  a  don't  can  entry  instead  of  as  an  error  entry  because  such  an  entry  is  never 
consulted  during  parsing. 

5.1.2.1      GOTO  Default  Actions 

Since  "don't  care"  entries  in  the  GOTO  table  are  never  accessed,  a  significant  decrease  in 
the  number  of  useful  entries  in  that  table  can  be  obtained  by  introducing  default  actions 
on  nontermineils  [33].  In  this  method,  this  is  achieved  as  follows.  Let  GOTOJDEF.Al'LT 
be  a  vector  whose  elements  are  indexed  by  the  nonterminads.  For  each  nonterminal  A.  scan 
the  column  of  GOTO  indexed  by  A  and  find  the  most  frequent  action,  act,  in  that  column. 
Remove  all  occurrences  oi  act  found  in  the  A  column  and  set  GOTOJDEF.\ULT(.A)  =  act. 
Looking  at  this  transformation  from  the  point  of  view  of  the  directed  graph  representation, 
this  is  equivalent  to  removing  all  incoming  edges  labeled  A  that  point  to  the  state  associated 
with  the  action  act.  (If  act  is  a  shift  or  lookahead-shift  then  all  the  incoming  edges  of  its 
corresponding  state  are  labeled  ^4.)  When  actions  are  removed  from  the  GOTO  table, 
these  entries  and  all  the  *"don't  care'"  entries  are  replaced  by  error  entries.  During  parsing, 
if  a  state-nonterminal  pair  {p.  A]  yields  error,  GOTO_DEF.\ULT(A)  is  used. 
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5.1.2.2      Merging  of  Compatible  States  and  Default  Reductions 

In  chapter  3,  the  concept  of  default  reduce  actions  was  introduced  as  a  way  to  optimize 
space  in  the  ACTION  matrix.  In  this  method,  the  '"default  reduce  actions"  optimization  is 
combined  with  the  merging  of  compatible  states.  Together,  these  two  optimizations  result 
in  a  significant  decrease  in  the  space  requirement  of  the  ACTION  matrix. 

Definition  5.1.1  A  state  p  is  said  to  be  compatible  with  another  state  q  if  and  only  if 
the  following  conditions  are  satisfied: 

•  p  has  the  same  terniinaJ  transitions  as  q;  i.e.,  the  rows  of  the  ACTION  matrix 
associated  with  p  and  q  have  the  same  shift,  shift-reduce  and  lookahead- shift  actions 
defined  in  their  corresponding  columns. 

•  The  set  of  rules  on  which  reduce  actions  are  defined  in  q  is  the  same  as  the  set  of 
rules  on  which  reduce  actions  are  defined  in  p. 

•  For  each  terminaJ  a  on  which  a  reduce  action  is  defined  in  both  p  and  q.  the  reduce 
action  is  the  same. 

Two  compatible  slates  p  and  q  can  be  merged  into  a  single  state  r  where  the  terminal 
transitions  of  r  are  the  same  as  the  terminal  transitions  of  p  (or  q)  and  the  reduce  actions 
of  r  is  the  union  of  the  reduce  actions  in  p  and  the  reduce  actions  in  q.  During  parsing, 
if  state  r  is  entered  in  a  context  where  p  should  have  been  entered  and  a  reduce  action 
that  originated  solely  from  q  is  executed,  that  action  has  the  same  effect  as  a  default 
reduction.  Similarly,  if  r  is  entered  instead  of  q  and  a  reduce  action  of  p  is  executed  in 
r.  it  can  be  viewed  as  a  default  reduction.  Thus,  unlike  the  method  of  [30].  when  states 
are  merged  based  on  the  above  criteria,  the  resulting  merged  states  retains  their  error 
detecting  capabilty  and  no  external  boolean  check  matrix  is  needed. 

The  set  of  states  in  a  parser  can  be  partitioned  into  a  set  of  compatible  classtf'  based 
on  the  concept  of  compatible  states  described  above.  In  other  words,  a  compatible  class  of 
states  is  a  subset  of  the  set  of  states  where  any  two  states  in  the  class  are  compatible.  In 
general,  finding  an  optimal  partition  of  compatible  classes  which  minimizes  the  number  of 
subsets  in  the  partition  is  NP-complete.  However,  the  practical  goal  one  tries  to  achieve  in 
merging  compatible  states  is  to  take  advantage  of  the  fact  that  most  languages,  especially 
programming  languages,  contain  some  phrases  that  are  used  in  different  contexts,  but  the 
actions  induced  by  these  phrases  in  the  different  contexts  are  compatible.  For  example, 
most  procedural  languages  support  arithmetic  and  relational  expressions  which  are  derived 
from  a  generic  nonterminal  expression  and  used  in  different  statements  such  a^: 

statement      —  variable    —  expression. 

I  while  expression  loop  . . . 

I  until  expression  loop  .  . . 

I  for  expression  do  . . . 

I  if  expression  then  . . . 

Given  a  set  of  states  of  an  LR  parser  whose  kernel  items  are  cJl  of  the  form  [q  •  A3]. 
where  o  and  3  are  arbitrary  strings,  any  two  such  states  satisfy  conditions  1  and  2  of  the 
compatibihty  test.   In  the  above  example,  the  nonterminal  expression  appears  in  phrases 

82 


that  are  all  produced  from  the  same  source,  statement,  and  no  two  of  these  phrases  share 
the  same  prefix  in  that  context.  The  LR  parser  for  a  grammar  containing  such  rules  will 
contain  a  state  for  each  rule  whose  sole  kernel  item  is  the  item  derived  from  the  right-hand 
side  of  the  rule  in  question  with  the  "dot"  in  front  of  expression.  These  states  form  a 
compatible  class. 

Experiments  have  shown  that,  in  most  cases,  a  set  of  states  that  meets  conditions 
1  and  2  of  the  compatibility  test  above  is  made  up  of  states  with  kernel  items  having 
identical  dot  symbols.  This  is  an  important  observation  because  it  is  condition  3  that 
renders  the  problem  of  finding  the  smallest  partition  of  compatible  classes  NP-complete 
(see  Appendix  B  for  proof).  To  take  advantage  of  this,  a  practical  algorithm  for  merging 
states  can  be  implemented  in  three  steps  based  on  the  three  conditions  of  the  compatibility 
steps  as  follows: 

stepl:  Construct  a  coarse  partition  based  only  on  condition  1;  i.e.,  states  are  grouped 
together  if  their  terminal  transitions  are  identical. 

step2:  Each  partition  class  obtaiined  from  step  1  is  broken  down  into  smaller  classes  by 
applying  condition  2. 

step3:  Finally,  each  class  C  obtained  from  step  2  whose  states  contain  reduce  actions  by 
more  than  one  rule  is  further  fragmented  by  repeating  the  foUowing  heuristic  process 
until  C  is  empty: 

Remove  an  arbitrary  state  from  C  and  use  it  to  initialize  a  new  singleton 
class  C .  Next,  each  state  p  in  C  is  considered,  in  turn.  If  p  is  compatible 
with  all  the  states  in  C,  it  is  removed  from  C  and  added  to  C . 

Once  the  states  of  an  LR  parser  have  been  partitioned  into  compatible  classes,  each 
class  of  states  is  merged  into  a  single  merged  state.  The  new  merged  states  obtained  by 
this  process  are  used  to  form  a  new  ACTION  matrix.  An  extra  column  indexed  by  a 
special  default  symbol  is  appended  to  the  front  of  ACTION.  A  default  reduce  action  is 
computed  for  each  merged  row  and  that  value  is  entered  in  the  default  column  position 
of  the  row  in  question.  The  default  reduce  action  of  a  given  row,  as  described  before,  is 
the  most  frequently  occurring  reduce  action  in  that  row  if  there  is  any.  otherwise  it  is  the 
error  action. 

Now.  since  the  terminal  transition  entries  in  the  new  .ACTION  matrix  still  refer  to  the 
original  state  numbers  and  having  merged  the  rows  of  the  ACTION  matrix  independently 
of  the  GOTO  table,  one  must  be  able  to  relate  each  original  state  of  the  parser  (still 
associated  with  a  row  of  GOTO)  with  its  merged  counterpart  in  ACTION.  This  can  be 
achieved  by  adding  a  new  column  to  the  GOTO  table,  each  element  of  which  points  to  its 
associated  merged  row  in  the  new  ACTIO.N  matrix.  From  now  on,  ACTION  will  denote 
the  new  matrix  obtained  after  compatible  states  have  been  merged  and  default  reductions 
computed. 

5.1.3     Compaction  of  the  Parsing  Tables 

Reccdl  that  in  order  to  execute  a  reduce  action  by  a  given  rule,  the  length  of  the  right- 
hand  side  and  the  left-hand  side  nonterminal  of  the  rule  must  be  known.   Therefore,  to 
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accommodate  these  actions,  a  vector.  RHSJSIZE,  indexable  by  rule  numbers  is  initialized 
with  the  length  of  the  right-hand  side  of  each  rule  and  a  vector,  LHS,  also  indexable  by 
rule  numbers  is  initialized  with  the  left-hand  side  symbol  of  each  rule. 

LR  parsers  have  two  important  properties  which  will  be  used  a^  the  basis  of  some 
important  optimizations  to  improve  both  space  utilization  and  the  time  performance  of 
the  compacted  tables.  They  are  reviewed  here: 

1.  Since  each  time  a  parsing  action  is  executed  it  determines  the  next  state  that  will  be 
entered,  the  direct  access  property  of  the  matrix  does  not  have  to  be  preserved.  In 
other  words,  since  the  parser  can  only  enter  a  state  p  if  a  read  transition  places  it  in 
p  or  a  read  transition  had  previously  placed  p  in  the  state  stack  and  a  reduce  action 
re-exposed  p  on  top  of  the  stack  by  popping  states  corresponding  to  the  right-hand 
side  of  a  rule,  only  actions  that  are  read  transitions  to  state  p  need  to  be  able  to 
access  p. 

2.  Another  automaton,  isomorphic  to  the  original  automaton,  that  recognizes  the  same 
language  may  be  constructed  by  permuting  and/or  relabeling  the  rows  and  columns 
of  the  parsing  tables,  and  changing  the  read  transition  entries  accordingly.  Viewed 
as  a  graph,  permuting  the  rows  constitutes  a  relabeling  of  the  vertices  (states)  of  the 
graph;  and  likewise,  permuting  the  columns  constitutes  a  relabeling  of  the  edges  of 
the  graph  or  a  renaming  of  the  symbols. 

5.1.3.1      The  GOTO  Matrix 

Usually,  the  GOTO  matrix  contains  many  rows  with  no  useful  entries  in  them.  After 
nonterminal  actions  are  removed  from  GOTO  by  default,  the  usually  sparse  matrix  ends 
up  being  even  sparser.  With  the  addition  of  the  column  that  identifies  the  corresponding 
rows  in  .\CTIOX  as  the  first  column  in  GOTO,  many  rows  that  previously  had  no  action 
end  up  with  a  single  action  and  the  resulting  matrix  almost  always  has  harmonic  decay. 
Before  compacting  the  GOTO  matrix,  its  columns  are  sorted  in  decreasing  order  by  the 
number  of  useful  entries  they  contain.  The  GOTO  matrix  is  then  compacted  into  two 
one-dimensional  arrays.  GOTO.CHECK  and  GOTOJNTO.  with  lower  bound  l^l  +  1. 

Since  each  element  of  the  identification  column  contains  a  useful  entry,  this  column 
will  be  the  first  column  in  GOTO  after  the  columns  have  been  sorted.  It  follows  that 
each  row  of  GOTO  will  have  a  unique  displacement  in  its  compressed  representation  and 
the  column  indexes  (nonterminal  symbols)  of  GOTO  can  be  used  as  check  symbols  in 
GOTO.CHECK.  Furthermore,  since  the  application  of  default  actions  to  GOTO  usually 
removes  all  useful  entries  in  some  columns,  the  sorting  places  these  empty  columns  in 
the  upper  range  of  the  column  indexes  -  minimizing  the  index  values  that  may  appear  in 
GOTO.CHECK.  Finally,  observe  that  the  elements  of  GOTO.CHECK  that  correspond  to 
a  state  identification  entry  in  GOTOJNTO  can  be  used  for  any  purpose  since  they  are  not 
associated  with  any  parsing  action.  One  only  has  to  ensure  that  a  value  placed  in  these 
locations  does  not  fall  in  the  range  of  the  column  indexes  of  GOTO. 
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5.1.3.2  Encoding  of  the  Parsing  Actions 

Since  the  goto  actions  in  GOTO  JNFO  represent  the  transitions  of  a  finite  state  automaton, 
each  such  state  entry  can  be  replaced  by  its  corresponding  row  displacement.  This  mapping 
eliminates  the  need  for  a  displacement  vector.  Furthermore,  it  allows  a  convenient  encoding 
of  the  different  kinds  of  values  that  may  appear  in  the  parsing  tables.  Let  m  be  the  highest 
displacement  associated  with  a  state,  any  integer  r  such  that  1  <  r  <  If*!  represents  a  rule; 
any  integer  q  such  that  \P\  <  q  <  rn  can  potentially  represent  a  state;  the  accept  action  is 
represented  by  the  constant  m  +  1  and  the  error  action  is  represented  by  m  +  2;  an  integer 
sr  such  that  error  <  sr  <  error  +  |P|  represents  a  shift-reduce  action;  finaUy,  an  integer 
Is  such  that  error  +  \P\  <  Is  potentially  represents  a  lookahead-shift  action. 

With  this  encoding,  one  can  easily  identify  what  kind  of  action  a  given  entry  in  a 
compressed  parsing  table  represents.  Reduce  and  goto-reduce  actions  are  represented  by 
the  number  r  associated  with  the  rule  in  question.  Shift  and  goto  actions  are  represented 
by  the  number  q  associated  with  the  state  in  question.  Shift-reduce  actions  are  encoded 
as  described  above  and  the  constant  error  must  be  subtracted  from  these  values  to  obtain 
the  rule  in  question.  Similarly,  the  constant  error  +  \P\  must  be  subtracted  from  the  value 
of  a  lookahead-shift  action  to  obtain  the  lookahead  state  in  question. 

.A.fter  the  GOTO  table  has  been  compressed,  enough  information  is  known  to  encode 
all  parsing  actions  in  both  matrices  except  the  lookahead-shift  actions  since  these  actions 
are  transitions  to  lookahead  states  that  are  associated  with  rows  in  ACTION  that  have  no 
counterpart  in  GOTO.  The  initialization  of  the  state  identification  entries  in  GOTOJ.XFO 
must  also  be  deferred  until  after  .■\CTION'  has  been  compacted.  Thus,  a  clean-up  phase 
is  required  after  the  compaction  of  .\CTION  has  been  performed.  This  will  be  discussed 
later. 

5.1.3.3  The  Action  Matrix 

A  typical  LR  parser  usually  contains  many  states  (but  less  than  half  the  total  number  of 
states)  on  which  only  a  single  terminal  action  is  defined.  The  removal  of  many  reduce 
actions  by  default  further  increases  the  number  states  with  a  single  action  as  well  as  the 
sparseness  of  the  .ACTION  matri.x.  If  an  optimized  ACTION  matrix  is  considered  without 
its  default  reduction  column,  it  usually  has  harmonic  decay.  Therefore,  a  good  theoretical 
approach  for  compacting  .ACTION  is  to  remove  the  default  reduction  column  from  it.  store 
the  default  reduce  action  associated  with  each  state  in  the  unused  entry  in  GOTO.CHECK 
associated  with  that  state,  and  compress  .ACTION  without  the  default  reduction  column. 
Note  that  when  this  approach  is  used,  the  rows  of  ACTION  will  not  automatically  have 
unique  displacements.  From  a  practical  point  of  view,  it  may  be  benefici2il  to  enforce  this 
restriction  because  a  typical  grammar  contains  fewer  than  255  terminals  but  a  parser  for 
such  a  grammar  usually  contziins  several  hundred  states.  Therefore,  if  the  column  indexes 
of  -ACTION  can  be  used  as  check  values  they  can  be  accommodated  in  a  single  S-bit  byte. 
On  the  other  hand,  the  number  of  states  may  not  only  exceed  255  but,  moreover,  when 
states  are  replaced  by  their  displacements,  this  further  increjises  their  range. 

A  second  approach  is  to  assume  that  each  entry  in  the  default  reduction  column  is  useful 
and  sort  the  columns  of  ACTION  in  decreasing  order  by  the  number  of  useful  entries  in 
them.  .As  in  the  case  of  the  GOTO  matrix,  this  guarantees  that  a  column  contaiining  only 
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useful  entries  will  be  the  first  column  in  ACTION  and  each  row  of  ACTION  wiU  have  a 
unique  displacement  in  the  compressed  representation.  ACTION  is  compressed  into  two 
one-dimensioned  arrays,  ACTION.CHECK  and  ACTIONJNFO,  with  lower  bound  1. 

If  the  input  grammar  used  to  generate  an  LR  parser  does  not  contain  useless  nontermi- 
nals (nonterminals  that  do  not  produce  any  terminal  string),  each  state  of  the  parser  wiU 
have  at  least  one  terminal  action.  In  that  case,  when  the  entries  of  the  default  reduction 
column  are  considered  to  be  useful  entries,  each  row  of  ACTION  has  at  least  two  useful 
entries  and  thus,  ACTION  does  not  have  harmonic  decay.  However,  experiments  have 
shown  that  when  the  columns  of  ACTION  are  sorted  in  decreasing  order  prior  to  apply- 
ing the  first-fit  decreasing  method,  this  rearrangement  tends  to  produce  good  compaction. 
This  second  approach  for  compacting  ACTION  is  the  one  that  is  used  in  this  method. 
For  practical  reasons  which  will  become  clearer  later,  it  turns  out  that  in  most  cases,  this 
approach  saves  space  and,  in  any  case,  it  produces  tables  that  give  a  slightly  faster  time 
performance. 

5.1.3.4      Compressed  Tables  Update 

In  this  section,  the  final  clean-up  phase  of  the  compaction  procedure  is  described.  Let 
d  be  the  displacement  (in  ACTIO.NJNFO)  of  a  row  associated  with  a  lookahead  state. 
Each  entry  in  ACTIONJNFO  that  is  a  transition  into  that  lookahead  state  is  replaced 
by  the  value  error  +  \P\  +  d.  Thus,  when  a  lookahead-shift  action  is  decoded,  it  is  the 
displacement  of  the  relevant  lookahead  state  in  .ACTIONJNFO  that  is  obtained.  Ne.xt. 
the  displacement  element  in  GOTOJNFO  associated  with  each  state  is  updated  with  the 
displacement  of  its  corresponding  ACTION  row  in  .ACTIONJNFO.  The  encoding  of  all 
parsing  actions  in  the  compressed  tables  is  now  complete. 

In  order  to  avoid  having  to  check  for  boundary  conditions  during  parsing,  one  must 
be  able  to  check  whether  or  not  a  useful  entry  is  defined  for  any  state-symbol  pair.  This 
implies  that  the  GOTO.CHECK  vector  must  contain  at  least  m  +  |.V|  elements,  where 
777  is  the  highest  displacement  associated  with  a  state.  Similarly,  the  .ACTION.CHECK 
vector  must  be  extended  to  contain  m'  +  |7"|  elements,  where  m'  is  the  highest  displacement 
associated  with  an  .ACTION  row. 

Finally,  since  each  element  of  GOTO.CHECK  whose  location  corresponds  to  the  loca- 
tion of  a  state  identification  entry  in  GOTOJNFO  is  still  unused,  the  original  row  index 
(negated  to  diflerentiate  it  from  check  values)  associated  with  the  state  in  question  can 
be  stored  at  that  location  in  GOTO.CHECK.  This  information  is  not  needed  for  parsing 
(even  though  it  might  be  useful  for  debugging  purposes),  but  it  provides  a  convenient 
remapping  of  each  displacement  into  its  original  state  number  which  is  useful  for  storing 
the  error  recovery  maps  t.symboh  and  nt.symboU.  The  domain  of  these  two  maps  is  the 
set  of  states  of  the  parser.  This  remapping  allows  the  stales  to  be  represented  by  the  row 
indexes  of  GOTO  instead  of  the  displacements  in  GOTOJ.NFO  which  fall  in  the  larger 
range  \F\  +  1.. accept  -  1. 

.Note  that  when  this  remapping  of  the  slates  is  used  with  the  first  approach  for  compact- 
ing ACTION  discussed  in  the  previous  section,  an  additional  DEF.AULT  vector,  indexable 
by  the  states,  must  be  used  to  store  the  default  reduce  action  associated  with  each  state. 
This  arrangement  not  only  requires  extra  space  for  the  new  vector  but  it  also  requires 
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Figure  5.3:  Compressed  parsing  tables 


—  Assume  NUM.RULES  =  \P\    and     LA.OFFSET  =  error+  NUM.RULES. 

1.  function  ACTION(q,  ti.tt). 

2  q-  :=  GOTOJNFO(q). 

3  if  ACTION.CHECK(q'  +  t,)  =  tj  then 

4  act  :=  ACTION JNFO(q-l- ti), 

5  else  act  :=  ACTIONJNTO(q'): 

6  end  if. 

7  i    -  1; 

8  while(act  >  LA.OFFSET)  loop 

9  i  :=  i  +  1; 

10  q    :=  act  -  LA.OFFSET; 

11  if  ACT10N-CHECK(q'  +  t.)  =  t,  then 

12  act  :=  ACTIONJN"FO(q'  +  t,): 

13  else  act  :=  ACTION  JNFO(q). 

14  end  if; 

15  end  loop, 

16  return  act; 

17  end  ACTION. 

1  function  GOTO(q    A); 

2  ifGOTO.CHECK(q  +  A)  =  A  then 

3  return  GOTOJNFO(q  +  A). 

4  else  return  GOTOJ)EFAULT(A); 

5  end  if, 

6  end  GOTO. 


Figure  5.4:  Parsing  functions 
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an  additional  indexing  operation  to  compute  a  default  reduce  action.  Experiments  have 
shown  that  the  number  of  useful  entries  in  an  optimized  ACTION  matrix  is  usuaDy  so 
small  that  the  extra  space  lost  by  not  obtaining  a  perfect  compaction  using  the  second 
approach  is  exceeded  by  the  space  required  to  store  the  DEFAULT  vector.  The  compu- 
lation of  a  terminal  action  with  the  second  approach  always  requires  the  same  number  of 
operations. 

5.1.3.5      Computing  Parsing  Actions  on  the  Compacted  Tables 

Figure  5.3  shows  a  graphicaJ  representation  of  the  compacted  tables,  {q*  denotes  the 
TOW  index  of  GOTO  with  displacement  q.)  Given  such  a  representation  the  ACTION 
and  GOTO  functions  of  Figure  5.4  compute  the  action  associated  with  a  state-terminal 
or  state-nonterminal  pair,  respectively.  Note  that  if  the  tables  were  constructed  from  an 
LALR(l)  parser,  the  code  segment  in  lines  7  through  15  of  the  ACTION  function  can  be 
omitted.  The  value  returned  by  the  ACTION  function  is  either  a  reduce  action  if  it  is 
less  than  or  equal  to  NUM.RULES,  a  shift  action  if  it  is  greater  than  NUM.RULES  but 
less  than  accept,  or  a  shift-reduce  action  if  it  is  greater  than  error.  In  the  case  of  a  shift - 
reduce  action,  error  is  subtracted  from  the  value  returned  by  ACTION  to  obtain  the  rule 
in  question.  The  value  returned  by  the  GOTO  function  is  either  a  goto-reduce  action  if  it 
is  less  than  or  equal  to  NUM.RULES  or  a  goto  action,  otherwise. 

5.2      Error  Recovery  Tables 

The  error  recovery  method  proposed  in  the  previous  chapter  requires  that  some  information 
be  precomputed  from  the  input  grammar  and  its  automaton.  This  information  includes 
the  scopes,  discussed  in  section  4.3.3.1,  the  map  t.symbols  from  each  state  q  into  a  relevant 
subset  of  viable  terminal  error  recovery  candidates  for  q.  the  map  nt .symbolf-  from  each 
state  q  into  a  relevant  subset  of  viable  nonterminal  error  recovery  candidates  for  q  and  the 
namti'  associated  with  each  viable  error  recovery  candidate. 

5.2.1     Optimization  of  Candidates 

Consider  the  case  of  a  secondary  substitution  in  which  a  recovery  goal  .4  must  be  inserted 
into  the  input  stream.  In  such  a  case,  every  nonterminal  candidate  in  the  recovery  stale 
q,  in  question  is  a  potentitil  reduction  goal.  However,  an  implementation  that  checks  all 
potential  candidates  for  each  error  phrase  would  be  prohibitively  slow. 

Two  optimizations  are  applied  to  the  set  of  nonterminal  candidates  in  a  given  state  to 
obtain,  in  most  cases,  a  substantially  reduced  subset  of  relevant  reduction  goals. 

In  [27].  the  following  concept  is  presented:  a  reduction  goal  A  of  error  phrase  3\x  in 
error  configuration  Q3\Ty  is  important  if  3\i  has  no  reduction  goal  B  such  that  B  — ■•"  A. 
In  this  method,  a  more  restricted  concept  of  an  important  symbol  is  used.  The  new  concept 
takes  into  consideration  the  full  context  of  the  error  phrase: 

Definition  5.2.1  A  nonterminal  A  on  which  a  transition  is  defined  in  a  state  q,  is  said  to 
be  important  if  A  does  not  appear  in  a  single  item  of  the  form  B  —  -A  in  q,. 
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E^E+T  E  —   T 

T—T^F  T-^F 

F  ^  F  ]  P  F  ^   P 

P-^id  P~(E) 


Figure  5.5:  Items  in  a  state  q, 


For  example,  assume  a  recovery  state  9,  contains  the  set  of  items  shown  in  Figure  5.5. 
By  the  definition  of  [27],  the  only  important  reduction  goal  in  such  a  state  is  £,  since  T, 
F  and  P  can  be  derived  from  E  via  a  chain  of  unit  productions.  By  the  more  restricted 
definition  of  this  method,  T  and  F  would  also  be  considered  important  symbols  since  they 
appear  immediately  to  the  right  of  the  dot  in  more  than  one  item.  To  understand  the 
importance  of  T  and  F,  assume  that  the  rules  from  which  the  items  of  Figure  5.5  are 
derived  are  all  the  productions  of  a  grammar  and  consider  the  following  erroneous  input 
strings: 

(   )    )   (  *  id  +  id 
(   )    )   (   I  Id  f  Id 

If  E  is  the  only  important  symbol  considered,  then  the  best  secondary  repair  that  is 
achievable  is  the  replacement  of  "(  )  )  (  ♦  id"  by  £  in  the  first  sentence  and  "(  )  )  (  1 
id"  by  E  in  the  second  sentence.  However,  it  is  clear  from  the  grammar  that  replacing  "*( 
)    )   ("  by  T  in  the  first  sentence  and  by  F  in  the  second  sentence  would  be  preferable. 

One  further  notices  that  using  /"  as  a  reduction  goal  in  the  first  sentence  would  have 
worked  just  as  well,  since  after  a  transition  on  F.  with  the  symbol  "*"  as  lookahead.  a 
reduction  by  the  rule  "J  —  F~  would  be  applied.  Similarly,  P  could  have  been  used  as 
a  suitable  reduction  goal  in  both  sentences.  This  leads  to  the  following  concept,  on  which 
the  second  optimization  is  based: 

Definition  5.2.2  A  nonterminal  element  C  of  a  set  of  nonterminal  candidates  S  in  an 
LR  stall  q  is  said  to  be  relevant  with  respect  to  S  if  there  does  not  exist  a  nouterniinal  D. 
such  that  D  £  S,  D  ^  C,  and  D  can  be  svccessfully  substituted  for  C  as  a  reduction  goal 
for  any  error  phrase  irith  q  as  the  recovery  stale. 

Given  a  set  5  of  nonterminal  candidates  for  a  given  state,  the  objective  is  to  find  the 
largest  subset  S'  C  S  such  that  5'  contains  only  relevant  reduction  goals. 

Lemma  5.2.1   Let  S  =  {5j B^]:  for  1  <  i  <  k.  B,  e  S  is  relevant  iff  ^Bj.  j  i-  i. 

such  that  B,  =>+^  Bj. 

The  proof  follows  directly  from  the  definition  of  an  LR  parser.  If  a  nonterminal  B  can 
be  substituted  for  an  error  phrase,  then  the  recovery  symbol  /  in  question  must  be  a  valid 
lookahead  symbol  for  any  rule  derivable  from  B.  In  particular,  if  B,  =>;f„  Bj  and  Bj  is 
substituted  for  an  error  phase  where  B,  is  known  to  be  a  vaLd  reduction  goaJ.  the  recovery 
symbol  will  cause  Bjlo  be  reduced  to  B,. 
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For  each  state  q  in  an  LR  automaton,  the  set  nt.symbols{q)  is  obtained  a^  follows. 
Starting  with  the  set  of  nonterminal  symbols  on  which  an  action  is  defined  in  9,  remove  all 
unimportant  symbols  from  that  set,  and  reduce  the  resulting  set  further  by  removing  all 
irrelevant  reduction  goals  from  it.  For  example,  consider  the  state  9,  of  Figure  5.5.  State 
q,  contains  nonterminal  transitions  on  the  symbols  £,  T,  F  and  P.  The  only  unimportant 
symbol  in  that  set  is  P.  After  P  is  removed,  the  irrelevant  symbols  E  and  T  are  removed 
from  the  subset  {E,T,F}  leaving  F  as  the  only  relevant  reduction  goal  in  q,. 

The  notion  of  an  important  symbol  can  also  be  extended  to  terminal  candidates  in 
the  t.symbols  sets.  Once  again,  consider  the  state  q,  of  Figure  5.5.  This  state  contains  a 
single  terminal  action  on  the  symbol  id,  but,  since  id  appears  only  in  the  item  P  —  -id, 
it  is  not  an  important  candidate  in  9,.  The  removal  of  unimportant  terminals  improves 
the  time  performance  of  the  primary  recovery  and  saves  space.  However,  it  may  suppress 
some  opportunities  for  merging  and  misspelling  corrections. 

In  section  5.2.3,  an  algorithm  is  presented  that  can  be  used  to  further  reduce  the  space 
used  by  t.symbols  and  nt.symbols. 

5.2.2  Optimization  of  Names 

In  order  for  the  error  recovery  to  issue  meaningful  diagnoses,  the  names  of  the  symbols 
used  in  the  grammar  should  be  saved.  In  fact,  it  is  very  helpful  to  allow  the  user  to  map 
terse  symbol  names  used  in  his  grammar  into  more  descriptive  names.  For  example,  a 
nonterminal  "relop"  might  be  mapped  into  the  string  "relational  operator";  the  end-of-file 
terminal  in  an  Ada  grammar  might  be  mapped  into  the  string  "End  of  Ada  source",  etc. 
Depending  on  the  nature  of  an  input  grammar,  some  nonterminals  may  never  be  used 
in  issuing  a  diagnosis.  Therefore,  the  names  associated  with  these  symbols  can  be  removed. 
Two  kinds  of  nonterminals  fall  in  this  category:  nullable  nonterminals  and  nonterminals 
that  are  always  subsumed  by  another  nonterminal.  The  processing  of  nuUable  nonterminals 
in  issuing  diagnostic  messages  was  discussed  in  section  4.4.3.  A  nonterminal  B  is  said  to 
be  always  subsumed  by  another  nonterminal  if  whenever  B  appears  in  the  right-hand 
side  of  a  rule,  the  rule  in  question  is  a  single-production  of  the  form  A  —  B,  for  some 
nonterminal  A.  In  such  a  case,  B  (actually,  the  name  associated  with  B)  wiU  never  appear 
in  a  diagnostic  message  since  the  left-hand  side  symbol  A  in  question  would  be  a  higher 
symbol  (see  section  4.4.3)  than  B. 

5.2.3  Optimization  of  Finite  Subsets  of  a  Small  Universe 

Observe  that  most  of  the  information  generated  for  error  recovery  consists  of  finite  sub- 
sets of  a  small  universe.  If  the  only  operation  performed  on  these  subsets  is  iteration,  a 
sequential  representation  of  these  sets  will  yield  a  good  time-performance.  For  example. 
t.f-ymboh  and  nt.symbols  are  mappings  from  states  into  subsets  of  terminals  and  nonter- 
minals, respectively.  During  error  recovery,  the  only  operation  performed  on  these  sets  of 
candidates  is  iteration. 

In  practice,  many  of  the  subsets  involved  in  these  applications  are  identical  and  can 
share  the  same  space.  In  addition,  the  resulting  collection  of  unique  subsets  can  be  further 
optimized  with  a  new  algorithm,  called  Optimal  Partition  [40],  that  takes  advantage  of  the 
fact  that  some  subsets  in  that  collection  are  proper  subsets  of  others.  Optimal  Partition  is 
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used  to  optimize  the  subsets  in  the  range  of  {.symbols  and  nt.symbols  as  well  as  the  sets  of 
stales  associated  with  the  scopes.  The  operation  performed  on  the  set  of  states  associated 
with  a  scope  is  a  lookup.  However,  such  a  set  is  usually  so  small  that,  in  practice,  a  look 
up  on  a  sequential  representation  of  it  is  acceptable. 

5.2.3.1  The  General  Problem 

Let  S  be  an  dphabet  and  let  C  be  a  collection  of  unique  subsets  of  D: 

C  =  {El,S2 ^n). 

Find  a  sequential  representation  of  C  that  minimizes  the  storage  space  required  such  that 
the  elements  of  each  E,,  1  <  !  <  n,  occur  in  consecutive  locations. 

A  set  C  is  said  to  have  the  consecutive  retrieval  property  with  respect  to  a  universe 
E  if  there  exists  an  organization  of  E  (without  duplication  of  any  symbol)  such  that  for 
every  L,  €  C  all  relevant  symbols  of  E,  can  be  stored  in  consecutive  storage  locations.  A 
polynomial  time  algorithm  for  finding  such  an  arrangement,  if  it  exists,  was  presented  by 
Fulkerson  and  Gross  [2].  However,  in  most  practical  cases,  a  pair  (E.C)  does  not  have 
the  consecutive  retrieval  property.  Hence,  duplication  of  symbols  must  be  allowed  so  that 
pertinent  symbols  corresponding  to  any  subset  in  C  are  always  stored  consecutively.  This 
problem  can  be  stated  more  formally  as  follows: 

Is  there  a  string  w  e  D"  with  |ic|  <  K  such  that  for  each  ;,  the  elements  of  I, 
occur  in  a  consecutive  block  of  |I^,|  symbols  of  u-? 

This  problem  known  as  Consecutive  Sets(CS)  was  posed  by  Kou  in  [1]  and  proven  to 
be  XP-complete.  Now,  consider  the  problem  of  finding  a  string  u\  of  minimum  length, 
required  to  store  the  subsets  of  C  such  that  the  elements  of  each  subset  occur  in  a  consec- 
utive block  of  symbols  of  w  and  if  any  two  subsets  E,  and  Tj  share  common  elements  then 
one  is  a  proper  subset  of  the  other.  This  problem  can  be  stated  more  formally  as  foDows: 

Is  there  a  partition  f  of  C  such  that  for  a  given  integer  K: 

•  Vpi-  €  P.  \/(x.y)  €  pI-    either  i  C  y  or  y  C  x- 

•  Let  Eit  represent  the  largest  subset  in  pk  €  P.  then  X!pkeP  l^*'l    -    ^ 

This  problem  is  the  Optimal  Partition  problem.  OP  for  short.  In  .Appendix  C.  OP  is 
proved  to  be  solvable  in  polynomial-time  by  reducing  it  to  the  maximum  weighted  bipartite 
matching  problem  [40].  In  the  remainder  of  this  section,  a  algorithm  based  on  OP  is  used  to 
optimize  the  space  required  to  store  a  set  of  sets  C  sequentially,  foUowed  by  the  description 
of  a  time-efficient  heuristic  for  OP  which,  in  most  cases,  finds  a  partition  that  is  very  close 
to  being  optimal. 

5.2.3.2  Application  of  OP 

With  OP.  one  can  lake  advantage  of  the  fact  that  there  is  no  overlap  of  subsets  in  the 
elements  of  the  partition  P  to  construct  a  string  v  such  that  each  subset  can  be  identified 
with  only  one  index  indicating  its  starting  point.  Assuming  #  is  a  special  marker  symbol. 
#  ^  r.  one  proceeds  as  follows: 
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stepO:  initialize  w  to  the  empty  string. 
stepl:  for  each  pk  e  P  do  steps  1.1-1.3 

stepl.l:  construct  [E^, ,  Ejtj,  ...,Et^,  E^^^J,  an  ordered  list  of  the  elements 

of  p„  where  |p,|   =   m,  E,   =   S,^,  v,_^^^    =   0,  and  S;,,    D   E^.^, . 

1  <  i  <  m.  It  is  known  that  such  a  linear  ordering  of  the  elements  of 

Pk  is  possible  by  the  definition  of  OP. 

stepl. 2:  for  i  in  l..m,  append  to  u'  all  symbols  that  are  in  !!<;,  but  not  in 

Et     . 
stepl. 3:  append  to  w  the  symbol  #. 

5.2.3.3      OP  Heuristic 

Even  though  OP  can  be  computed  by  a  polynomial-time  algorithm,  the  algorithm  in 
question  is  not  very  efficient  and  in  certain  environments  may  prove  to  be  impractical.  A 
simple  heuristic  for  OP  will  be  described  below.  In  most  cases,  this  approach  works  very 
well  and  can  be  computed  in  O(n^)  time. 

A  set  t,k  that  is  not  included  in  any  other  set  in  a  given  partition  is  called  a  base 
set.  The  following  heuristic  is  essentially  a  greedy  algorithm  that  constructs  the  classes  of 
the  partition  one  at  a  time  and  tries  to  minimize  the  length  of  the  final  string  by  always 
including  the  largest  subset  it  can  find.  The  partition  P  is  constructed  as  a  set  of  ordered 
lists  of  subsets,  where  each  such  list  make  up  a  class  of  the  partition  and  the  order  of  the 
elements  in  the  List  is  exactly  the  proper  linear  ordering  for  the  subsets: 

step  0:  P  =  0; 

step  1:   Construct  a  partition  P'  =  {p,  z^  ^  :   i  e  [1--|3I|]}  ofC.  where: 

p,  =  {Sj,:!:,  €Cand|E;.|  =  0 

step  2:  do  steps  2.1  and  2.2 

step  2.1:   Let  p^.  bt  the  class  m  P'  with  the  largest  subsets.  Let  j  bt  an  arbitrary 

subset  oj Pk-  remove  x  from  p^  and  initialize  list  :=  [i]: 
step  2.2:  if  p^  =  0  then  remove  p;.  from  P':  end  if 

step  3:  for  i  in  [k  -  l.k-  2..1]    |   p,  £  P'  loop 
if  3  y  €  p,  I  y  C  T  then 
remove  y  from  p, 
list  :=  list  +  [j/j; 
X  :=  y 

if  p,  =  0  then  remove  p,  from  P'\  end  if; 
end  if; 
end  loop; 

step  4:  P  =  Pij{hst]\ 

step  5:  if  P'  7^  0  then  goto  step  2;  else  stop;  end  if; 
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Using  a  bucket  sort,  step  1  can  be  computed  in  0(|I!|  +  n)  time.  For  steps  2-5.  the 
worst  case  occurs  when  no  subset  is  a  superset  of  another  and  each  p,  contains  exactly  one 
element.  In  that  case,  each  subset  is  removed,  in  turn,  from  p,  £  P'  and  tested  against  the 
element  in  eacl  Vj,  I  <  j  <  i.  The  total  number  of  subset  checks  in  that  case  is  n(n  -  1  )/2, 
which  gives  u^  a  totaJ  running  time  of  Oin"^).  Experimental  results  using  both  OP  and 
this  heuristic  algorithm  are  presented  in  Figure  C.l  in  section  C.3. 

5.3     Remarks 

When  parsing  tables  are  compressed  with  this  method,  each  parsing  action  can  be  com- 
puted with  a  fixed  number  of  primitive  operations  on  integers  and  arrays  of  integers.  For  an 
LALR(  1 )  parser,  the  computation  of  a  terminal  action  costs  exactly  3  indexing  operations, 

1  addition  and  1  comparison.  For  an  LALR(A:)  parser,  in  addition  to  the  initial  LALR(l) 
action,  k-  1  extra  lookahead  transitions  may  be  required.  Each  lookahead  transition  costs 

2  additions,  1  subtraction,  2  comparisons  and  2  indexing  operations.  The  computation  of 
a  nonterminal  action  always  costs  2  indexing  operations,  1  addition  and  1  comparison. 

Figure  D.l  of  Appendix  D  shows  the  space  requirement  of  the  parsing  and  error  recovery 
tables  of  6  programming  language  grammars. 
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Conclusion 


A  new  method  has  been  presented  for  constructing  practical  and  efficient  LALR(A-)  parsers 
with  automatic  error  recovery.  Significant  contributions  and  improvements  over  the  cur- 
rent state-of-the-art  were  made  in  three  areas:  parser  generation,  error  recovery,  table 
optimization  and  table  compaction. 

In  the  parser  generator  described  in  this  thesis,  only  the  minimum  amount  of  lookahead 
information  necessary  (up  to  k  levels)  to  disambiguate  inconsistent  states  in  the  LR(0)  au- 
tomaton is  computed.  Consequently,  this  parser  generator  is  efficient  since,  in  practice,  few 
states  require  more  than  one  lookahead.  Furthermore,  the  generated  parsers  are  efficient, 
because  at  run-time  the  parser  consults  extra  lookahead  symbols  only  when  necessary. 

The  LR(A-)  error  diagnosis  and  recovery  method  presented  here  is  completely  language- 
and  machine-independent  and  more  efficient  than  other  known  methods.  It  features  the 
following  innovations  and  techniques: 

•  a  new  deferred  driver  that  always  detects  an  error  at  the  earliest  possible  point: 

•  a  generalized  simple  recovery  method  that  uses  both  terminal  and  nonterminal  sym- 
bols; 

•  a  phrase-level  recovery  that  is  an  efficient  (and  completely  automatic)  generalization 
of  the  error  production  method; 

•  a  new,  completely  automatic  method  for  scope  recovery. 

A  number  of  innovative  space  optimization  ideas  are  introduced  in  this  thesis.  The 
number  of  useful  entries  in  the  parsing  tables  and  the  error  recovery  maps  are  reduced 
and  space  is  shared  among  similar  sets.  Following  optimization,  the  parsing  tables  and 
the  error  recovery  maps  are  compacted  separately.  This  aDows  complete  and  accurate 
information  to  be  retained.  The  resulting  compacted  parsing  tables  are  the  most  time- 
and  space-efficient  known  to  date. 
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Appendix  A 

Experimental  Results  with  the 
LALR  Algorithms 


In  this  appendix,  experimental  results  obtained  from  six  grammars  of  different  sizes  and 
complexity  are  presented.  The  table  of  Figure  A.l  summarizes  the  characteristics  of  each 
of  the  grammars  used.  The  Pascal!  and  Pascal2  grammars  are  LALR{2)  and  the  others 
are  LALR(  1 ).  Pascal2  is  the  grammar  of  Appendix  F.  Ada  is  the  grammar  of  [31].  Note 
that  for  the  Pascall  and  Pascar2  grammars,  under  the  '-states"  column  two  numbers  are 
reported.  The  first  is  the  number  of  LR(0)  states  and  the  second  is  the  number  of  lookahead 
states.  The  automaton  for  these  grammars  also  contains  a  lookahead-shift  action  for  each 
lookahead  state. 


Grammar 

\T\ 

|.v| 

1^1 

#iiems 

#stale5 

#shifts 

#shift-redure> 

#goto^ 

#goIf>-r<"du'~es 

Pascal 

63 

110 

213 

626 

193 

396 

329 

336 

574 

Pascall 

63 

111 

215 

625 

189/1 

393 

324 

332 

569 

Pascal2 

63 

111 

215 

627 

191/5 

393 

326 

334 

569 

C 

85 

98 

266 

786 

206 

349 

987 

975 

322 

Ada 

96 

304 

523 

1527 

516 

593 

700 

884 

1779 

Sedl 

123 

362 

746 

2398 

824 

1987 

5766 

2250 

4360 

Figure  A.l:  Grammar  information 

The  first  three  columns  in  the  table  above  indicate  the  number  of  terminals,  nonter- 
minals and  productions,  respectively,  in  each  grammar;  the  i^ilems  column  indicates  the 
total  number  of  items  in  the  grammar,  i.e.,  let  rhs,  represent  the  list  of  symbols  on  the 
right-hand  side  of  production  j,  then  #?/em5  =  YL,=\  k'^'^il  +  ^'-  ^^^  #5/a/e.s  column  in- 
dicates the  number  of  slates  in  the  LR(0)  automaton:  the  i^shifts  and  i^go1oi>  columns 
indicate,  respectively,  the  number  of  shift  and  goto  transitions  in  the  LR(0)  automaton. 
Figure  D.l  gives  more  information  about  these  grammars  and  their  automata. 

The  main  goal  of  the  experiments  was  to  compare  the  time  performance  of  each  method 
in  terms  of  the  number  of  union  operations  it  required  to  compute  the  L.A.LR(  1 )  lookahead 
sets  for  each  grammar.  Note  that  lookahead  sets  were  computed  only  when  necessary  lo 


resolve  conflicts.  In  particular,  lookahead  sets  associated  with  LR(0)  reduce  states  were 
rot  computed.  The  KM,  DP  and  PCC  method  were  implemented.  It  was  not  necessary 
to  implement  the  BL  method  since  its  time  performance,  in  terms  of  number  of  union 
operations  performed,  is  identical  to  the  DP  method. 

The  wM  algorithm  was  implemented  as  described  in  section  2.2.1.  The  number  of 
union  operations  required  is  reported  under  the  :^unions  column.  When  the  improvement 
suggested  in  section  2.3.1  is  added,  the  number  of  union  operations  required  is  calculated 
as  the  number  of  visits  {^visits)  made  to  TRANS.  Recall  that  in  such  a  case  a  preliminary 
pass  is  required  to  compute  the  DR  sets  for  each  state.  The  total  number  of  elements  in 
these  sets  is  equal  to  ^shifts.  The  data  from  the  KM  experiments  is  shown  in  figure  A. 2. 


KM 

i^visits 

#ur7ion5 

Pascal 

1176 

3382 

C 

13651 

36829 

Ada 

5861 

12038 

Sedl 

30770 

67479 

Figure  A.2:  Results  of  KM  experiments 

The  table  of  figure  A. 3  shows  the  number  of  union  operations  required  to  compute  the 
necessary  FIRST  sets.  The  .V  -  unions  column  indicates  the  number  of  union  operations 
required  to  construct  the  FIRST  sets  for  nonterminals.  The  5  -  vniont'  column  indicates 
the  additional  operations  required  to  compute  the  FIRST  sets  for  suffixes  that  follow  a 
nonterminal  in  a  production  of  P.  The  cost  of  computing  these  maps  was  incurred  by  the 
DP  and  PCC  algorithms. 
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Figure  A. 3:  Unions  required  for  FIRST  maps 

The  DP  method  was  implemented  with  the  improvements  suggested  in  section  2.3.2. 
The  data  from  the  DP  experiments  is  shown  in  the  table  of  figure  A. 4.  The  #RE.AD 
column  indicates  the  number  of  union  operations  required  to  construct  the  RE.\D  sets 
from  FIRST  sets.  The  #FOLLO\V  column  indicates  the  number  of  union  operations 
required  to  construct  the  FOLLOW  sets  from  READ  sets.  The  #L.A  column  indicates  the 
number  of  union  operations  required  to  construct  the  LA  sets  from  FOLLOW  sets.  The 
Total  column  includes  the  union  operations  incurred  by  the  construction  of  the  FIRST 
sets. 
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The  data  from  the  PCC  experiments  is  shown  in  the  table  of  figure  A. 5.  The  #PATH 
column  indicates  the  number  of  union  operations  required  to  construct  the  PATH  sets. 
The  #FOLLOW  and  #LA  columns  indicate  the  number  of  union  operations  required  to 
construct  the  FOLLOW  and  LA  sets,  respectively.  The  Jo/a/ column  includes  the  union 
operations  incurred  by  the  construction  of  the  FIRST  sets. 

Finally,  the  table  of  figure  A. 6  shows  how  the  different  methods  performed.  Each  entry 
in  the  table  indicates  the  total  number  of  union  operations  that  were  required  by  the 
particular  method  that  labels  the  column  and  the  grammar  that  labels  the  row. 
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Figure  A.4:  Results  of  DP  experiments 
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Figure  A. 5:  Results  of  PCC  experiments 


Grammar 

KM 

DP/Ch 

PCC 

Pascal 

1176 

1393 

2664 

Pascall 

1383 

Pascal2 

1420 

C 

13661 

3168 

6485 

Ada 

6861 

3127 

7533 

Sedl 

30770 

11036 

21484 

Figure  A. 6:  Comparison  of  the  methods 
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Appendix  B 

String  Compatibility 


Let  S  be  an  alphabet  containing  a  special  blank  character,  denoted  1>.  Consider  a  set  S  of 
71  strings  of  S  characters,  each  of  length  m: 

S  :=   {5i,52,---,-Sn}- 

A  string  .<:,  6  5  is  said  to  be  compatible  with  another  string  Sj  €  5  if  and  only  if  for 
1  <  A-  <  m.  either  s,{k)  =  Sj(k)  or  s,(k)  =  b  or  Sj{h)  -  b.  This  compatibility  relation  is 
reflexive  and  symmetric,  but  not  transitive. 

The  String  Compatibility  problem  is:  given  some  positive  integer  A',  can  5  be  par- 
titioned into  k  <  K  compalibk  claf<ses.  A  compatible  class  of  5  is  a  subset  of  5  whose 
elements  are  pair-wise  compatible.  The  String  Compatibility  problem  is  NP-complete. 
This  will  be  shown  by  reducing  chqu(  cover  [8]  (also  known  as  Partition  into  CHqvts  [22]) 
to  String  Compatibility  [44].  Recall  the  definition  of  Partition  into  Cliques: 

Instance:  Graph  G  =  (!'.£),  positive  integer  K  <  \\'\. 

Question:  Can  the  vertices  of  the  graph  G  be  partitioned  into  k  <  K  disjoint 
sets  \i,...,Va:  such  that  for  1  <  ?  <  k,  the  subgraph  induced  by  v,  is  a 
complete  graph  (also  called  a  clique). 

Reduction 

Given  an  instance  of  "Partition  into  Cliques'",  construct  a  negative  graph  G'  =  (WE'). 
where  £'  =  \'  x  V  -  £.  Let  each  e'  €  E'  be  represented  by  an  integer  in  the  range  l-.j^'l 
and  let  I^  =  \'  U  {»}.  The  set  5  is  constructed  by  mapping  each  vertex  r,  into  a  string  .<•, 
of  length  |£'|  and  adding  s,  to  5.  The  mapping  is  done  as  follows.  If  e'  =  (r.-ij)  €  E' 
then  6,{e')  =  r^;  otherwise  s,(e')  =  i>.  This  transformation  can  clearly  be  computed  in 
polynomial-time. 

Claim:  a  subset  of  the  strings  in  S  form  a  compatible  class  if  and  only  if  the  corre- 
sponding vertices  in  G  form  a  clique. 

//  Let  C  be  a  subset  of  5  that  forms  a  compatible  class,  then  for  each  pair  of  strings 
s,  and  *j  in  C,  (v,,Vj)  ^  E'.  Therefore,  (i',.i'j)  €  E.  It  follows  that  the  vertices  that 
correspond  to  the  strings  in  C  induce  a  clique  in  G. 
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Only  if.  Conversely,  given  a  clique  in  G  and  an  edge  (r,,rj)  in  that  clique,  the  cor- 
responding strings  s,  and  Sj  are  compatible.  In  fact,  if  6,  is  not  compatible  with  Sj  then 
there  exists  an  edge  e'  €  £"  such  that  .s(e')  =  Vj  and  Sjie']  -  i\.  But,  this  implies  that 
(r,,rj)  €  £"  which  is  a  contradiction. 

Thus,  G  can  be  partitioned  into  k  cliques  if  and  only  if  S  can  be  partitioned  into  A- 
compatible  classes.  D 
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Appendix  C 

Optimal  Partition 


C.l     Reduction  to  Maximum  Weighted  Bipartite  Matching 
Problem 

Given  an  alphabet  L  and  a  collection  C  -  {S1.S2 -n}.  v.here  for  each  E,  €  C  and 

E,  C  E.  an  effective  method  is  presented  to  construct  a  bipartite  graph  G  =  (\ ■,!'',£) 
with  a  weight  function  11' defined  on  E  such  that  a  maximum  weighted  matching  obtained 
from  G  defines  an  Optimal  Partition  for  C  with  the  elements  of  each  class  in  the  partition 
linearl>'  ordered. 

C.1.1      Bipartite  Graph 

The  bipartite  graph  G  =  (V.  V.  £)  is  constructed  fron  the  collection  C  as  follows.  Firstly, 
each  subset  H,  is  associated  with  a  unique  node  r,  6  \  and  a  unique  node  v\  €  V.  Next, 
two  bijective  functions  /  and  /'  are  used  to  make  the  respective  associations: 

j  -.C  -V 

f  :C-V' 
\    =  {/(L,):E,  €C} 
\-'={/'(E.):E,  €C} 

The  node  yielded  by  f(^,)  is  denoted  v,  and  the  node  yielded  by  /'(D,)  is  denoted  r,'. 
Two  nodes  f,  £  \'  and  i'  £  \''  corresponding  to  the  same  subset  E,  are  said  to  be  /l(•^n^. 

Let  the  operator  C  (D)  denote  proper  inclusion,  i.e..  A  C  B  {A  D  B)  means  that  .4  is 
a  subset  (superset )  of  B.  but  /i  7^  £.  A  binary  relation  R  C  C  x  C  is  defined  as  follows: 

i?  is  transitive,  ron-reflexive.  and  non-symmetric.  The  set  of  edges  £  C  V  x  \''  in  6' 
is  constructed  as  follows: 

E  =  {(v,.v'j)  :  V,  £  V.v'j  e  V  \  {E,.'£,)  e  R} 
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Given  an  edge  {v,,v'j)  £  E,  rj  is  said  to  be  a  successor  of  v,  and  i;,  is  a  predecessor  oi 
v'y  Let  A/  C  £■  be  a  bipartite  matching,  not  necessarily  maximal,  on  G  then  lemmas  1 
and  2  follow: 

Lemma  C.1.1    Each  v,  €  V  i.''  connected  to  at  most  one  successor  in  M 
Lemma  C.1.2  Each  v'.  €  V  is  connected  to  at  most  one  predecessor  in  M 

Proof:    definition  of  a  bipartite  matching. 

A  node  v,  €  \'  having  no  successor  is  called  a  terminal  node.  A  node  r,  €  V  whose 
twin  has  no  predecessor  is  called  a  base  node.  Let  0  be  the  set  of  base  nodes  induced  by 
M  and  let  7  be  the  set  of  terminal  nodes  induced  by  M, 

Lemma  C.1.3  /?  7^  0 

Proof:  Assume  that  /3  =  0.  Then  every  node  in  I'  is  matched  to  some  node  in  \''. 
This  implies  that  for  each  subset  Dj  €  C,  3  D,  and  D/,.  in  C  such  that  E,  C  ~j  C  ^k\  but. 
this  implies  that  the  relation  R  is  circular,  which  is  a  contradiction. 

Corollary  C.1.1   7  7^  0  and  \-,\  =  \d\ 

Proof:  By  the  same  argument  used  for  lemma  C.1.3  above.  7  ^  0.  By  definition  of 
bipartite  matching,  it  follows  that  the  number  of  unmatched  nodes  in  V  must  be  equal  to 
the  number  of  unmatched  nodes  in  \''. 

Consider  an  arbitrary  matching  M  and  the  sets  3  and  7  induced  by  it.    For  every 

r,  €  i3.  let  7(z-, )  be  the  set  of  node.s  i'a  such  that  there  is  a  sequence  r^t, .  i';!; '^"'^m  ^'h^f? 

r,  =  r/,-, .  r^^  €  7.  and  (i\^.r^  )  6  .\/.  1  <  7  <  tt?  -  1.  A  function  r  is  defined  on  each 
element  I,  £  C  where  v,  €  /?  as  follows: 

t{^,)^  {r'{vtc]:vkeT(v,)} 

A  subset  E,  corresponding  to  a  base  node  r,  is  referred  to  as  a  base  subset  and  E,  is 
said  to  define  the  set  r(i;,). 

Lemma  C.1.4  Let  P  =  {r(i;,)  ]  E,  =  /"'(r,)  where  v,  €  3]  then  P  is  a  partition  of  C 
and  tin  svbsds  contained  in  each  eltnuni  p  £  P  can  b(  lunarly  ordtrtd  by  thi  rtlution  R. 

1.  Given  an  arbitrary  subset  "Lk  €  C  it  is  included  in  one  of  the  (set)  elements  of  P. 

Proof:  If  '^k  is  a  base  subset  then  it  is  clearly  included  in  rC^k)  €  P-  If  -ji.  is 
not  a  base  subset  then  its  corresponding  node  f[  has  a  predecessor  in  M ,  say  Vj. 
corresponding  to  a  subset  Lj.  Since  R  is  transitive  and  non-circular  then  so  is  .\/. 
Therefore,  there  exists  a  base  subset  E,  such  that  CZi.TLk)  €  A/"*".  Hence.  E^.  is 
contained  in  an  element  of  P. 

2.  Given  two  base  subsets  1,  and  "Zk.  if  »  7^  k  then  T(l,)n  r(Dit)  =  0- 
Proof:     Lemma  C.1.1  and  C.1.2  and  definition  of  r. 
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3.  The  total  order  of  the  elements  of  t{I,,)  can  be  determined  from  the  ordering  of  the 
corresponding  sequence  of  T{v,). 

(1),  (2)  and  (3)  above  imply  that  P  is  a  partition  of  C  and  the  subsets  contained  in 
each  p  ^  P  can  be  linearly  ordered  according  to  R. 

Lemma  C.1.5  Let  P  be  a  partition  of  C  such  that  the  elements  of  each  p  ^  P  can  be 
linearly  ordered,  then  P  corresponds  to  a  matching  in  G. 

Proof:  Consider  each  ordered  sequence  p  =  [S^jiS^^' •  ••'^*m]  '^  turn.  Include  the 
set  of  edges  {{vk,,v'f.      )  :   1  <  j  <  m  -  1}  in  the  matching. 

Theorem  C.1.1  P  is  a  partition  of  C  whose  elements  can  be  linearly  ordered  based  on 
tht  relation  R  if  and  only  if  it  is  induced  by  a  matching  on  the  corresponding  graph  G. 

The  proof  follows  from  lemmas  C.1.4  and  C.1.5. 

C.1.2     Weight  Function 

A  weight  function  W  is  defined  on  edges  of  G  as  follows: 

H'-(r,.r;)=li:,| 

Applying  a  maximum  weighted  bipartite  matching  algorithm  to  G  yields  a  matching 
M  that  induces  a  set  of  base  nodes:  Q.  such  that  the  total  weight  of  M  is: 

Let  a  be  the  total  sum  of  the  lengths  of  the  elements  of  C: 

in 

Let  K  be  the  sum  of  the  lengths  of  the  base  subsets,  n  can  be  computed  given  a  and 
the  weight  of  the  matching  M .  as  follows: 

Recall  that  the  elements  of  the  base  subsets  are  concatenated  to  construct  the  string 
U-.  Therefore,  since  a  is  constant  and  the  sum  of  the  weights  of  the  non-base  subsets  is 
ma.\imized  by  the  maximum  weighted  bipartite  matching  algorithm,  it  follows  that  k  is 
minimized.  Hence,  the  final  result  which  is  restated  as  follows: 

Theorem  C.1.2  min  k    <=>     min  \u-\ 

The  proof  follows  from  Theorem  1  and  maximum  weighted  bipartite  matching. 
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C.2      Complexity 

Assume  that  one  can  test  whether  or  not  a  given  set  is  a  proper  subset  of  another  set  in 
unit  time.  This  is  a  reasonable  assumption  if  the  length  of  D  is  not  too  large  and  bit-strings 
are  used  to  represent  the  subsets.  With  that  assumption,  the  construction  of  the  graph  G 
from  the  collection  C  requires  O(n^)  operations. 

Tarjan  in  [7]  showed  that  the  maximum  weighted  bipartite  matching  problem  can  be 
reduced  to  the  max-flow/min-cost  problem  with  no  negative  cost  cycles.  He  also  presented 
an  algorithm  for  finding  a  minimum  cost  maximum  flow  /"  in  a  network  with  n  nodes  and 
m  edges  in  0(m|/"|  log,2^.^/„)  n)  time.  In  this  case,  the  cost  of  constructing  the  graph  is 
dominated  by  the  max-flow/min-cost  algorithm  and  |/'|  =  ri/2.  Hence,  the  total  running 
time  is: 

C.3      Experimental  Results  with  OP 

Experimental  results  with  the  t. symbols  maps  of  the  six  programming  language  grammars 
mentioned  in  Appendix  A  are  presented  in  figure  C.l. 
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Pascal2 
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Figure  C.l:  Experimental  results 

In  tlie  column  headed  t. symbols,  two  numbers  are  specified  for  each  grammar.  The  first 
number  indicates  the  sum  of  the  sizes  of  the  sets  in  the  range  of  (.symbols  and  the  second 
number  indicates  the  number  of  elements  in  the  map.  Gi\en  a  t.symbols  map.  before  OP 
is  applied  to  it.  the  sets  in  its  range  that  are  identical  are  merged.  In  the  column  headed 
"merged",  the  first  number  specified  for  eacii  element  indicates  the  sum  of  the  sizes  of 
the  sets  in  (he  merged  map  and  the  second  number  indicates  the  number  of  merged  sets. 
The  column  headed  "Heuristic"  shows  the  length  of  the  resulting  string  and  the  number  of 
bases  in  the  partition  obtained  using  the  heuristic  algorithm  described  in  section  5.2.3.3. 
The  column  headed  "OP"  shows  the  length  of  the  resulting  string  and  the  number  of  bases 
obtained  using  OP. 
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Appendix  D 

Statistics 


The  table  of  Figure  D.l  shows  some  statistics  about  the  6  grammars  described  earlier 
and  their  LALR(A-)  parser  constructed  with  this  method.  The  information  in  the  table  is 
self-explanatory.  In  specifying  storage  values,  an  8-bit  byte  machine  is  assumed.  That  is, 
integer  values  less  than  255  are  assumed  to  be  stored  in  a  single  byte;  integer  values  that 
are  greater  than  255  are  assumed  to  be  stored  in  two  bytes. 
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Pascal 

Pascall 

Pascal2 

c 

Ada 

Sedl 

Terminals 

63 

63 

63 

86 

96 

124 

Nonterminals 

110 

111 

111 

98 

304 

362 

Productions 

213 

216 

215 

266 

523 

746 

Storage  for  Rules 

424 

428 

428 

530 

1566 

2235 

Items 

626 

625 

627 

786 

1527 

2398 

Scopes 

13 

13 

13 

30 

25 

93 

States 

193 

189 

191 

206 

516 

824 

Look-ahead  states 

0 

1 

5 

0 

0 

0 

Shift  actions 

396 

393 

393 

349 

593 

1987 

Goto  actions 

336 

332 

334 

975 

884 

2250 

Shift/Reduce  actions 

329 

324 

326 

987 

700 

5755 

Goto/Reduce  actions 

574 

569 

569 

322 

1779 

4360 

Reduce  actions 

326 

320 

390 

1262 

1554 

4113 

Goto  entries  removed  by  default 

245 

244 

245 

915 

705 

1763 

Goto/Reduce  entries  removed  by  default 

484 

481 

481 

270 

1592 

3958 

Non-terminals  eliminated  by  default 

60 

62 

61 

52 

216 

206 

Length  of  GOTO.CHECK  Table 

485 

477 

480 

417 

1187 

2076 

Length  of  GOTOJNFO  Table 

374 

365 

368 

318 

882 

1713 

Useful  entries  in  GOTOJNFO 

374 

365 

368 

318 

882 

2076 

Storage  for  compressed  GOTO 

1455 

1431 

1440 

1251 

3561 

1713 

Terminal  states  after  merging 

130 

128 

133 

159 

356 

513 

Shift  actions  removed  by  merging 

456 

450 

451 

541 

679 

5484 

Reduce  actions  removed  by  merging 

30 

47 

48 

109 

579 

1450 

Reductions  removed  by  default 

263 

261 

326 

1143 

908 

2544 

Length  of  ACTION.CHECK  Table 

475 

471 

490 

1076 

1133 

3014 

Length  of  ACTIGNJNFG  Table 

428 

420 

436 

1049 

1037 

2951 

Useful  entries  in  AGTIONJNFO 

412 

420 

417 

964 

1037 

2890 

Storage  for  compressed  ACTION 

1331 

407 

1362 

3174 

3207 

8916 

Storage  for  Parsing  Tables 

2786 

2742 

2802 

4425 

67G8 

15144 

Actions  in  Compressed  Tables: 

Shifts  -  Shift/Reduces  +  LA  Shifts 

269 

268 

273 

795 

614 

2258 

Gotos  -r  Goto/Reduces 

181 

176 

177 

112 

366 

889 

Reduces 

55 

53 

62 

90 

212 

34  3 

Storage  for  Error  maps: 

Lsymboh  map 

399 

392 

394 

928 

1651 

3188 

nt ..'•ymboL  map 

328 

323 

327 

346 

1636 

2578 

SCOPE  map 

282 

280 

281 

442 

780 

2843 

Required  storage  for  Error  maps 

1009 

995 

1002 

1716 

4067 

8609 

Unoptimized  NAME  map 

2721 

2728 

2728 

2786 

8899 

9491 

Optimized  NAME  map 

1832 

1821 

1821 

2477 

3839 

6365 

Total  Storage  Requirement  for  Parsing  and  Error  Recovery  Tables 

(does  not  include  NAME  map)                        4704 

4642 

4712 

7088 

13588 

28064 

Figure  D.l:  Parser  Statistics 
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Appendix  E 

Error  Recovery  Examples 


Erroneous  Pascal  and  Ada  programs  were  processed  with  the  error  recoverj'  method  de- 
scribed in  this  thesis.  The  first  Pascal  program  was  designed  to  demonstrate  the  effective- 
ness of  the  scope  recovery.  The  second  program  was  constructed  from  some  interesting 
examples  drawn  from  [20].  The  Ada  example  is  taken  from  [37].  The  Pascal2  grammar 
of  Appendix  F  and  the  Ada  grammar  of  [-31]  were  used  exactly  as  is  -  no  user-supplied 
scopes  were  used.  Thus,  the  results  presented  here  were  obtained  automatically  from  the 
specification  of  the  input  grammars  with  no  intervention,  whatsoever,  from  the  user. 

Pascal2  Example  1 

1.   prograun  test (input .output) ; 


2.  var  i,j ,k 

3.  l.m.n 

4.  x.y.z 

5.  final 

6.  a,b,c 


integer; 

integer; 

real; 

rejil; 

array  [1..10]  of  real; 

7.  procedure  subd ,j .k : integer ;  x,y,2:real); 

8.  var  v,H  :  integer; 

9.  begin 

10.         a  :=  ((  3  ]]; 


<> 

•••Error:  ")"  inserted  to  complete  phrase 
•••Error:  ")"  inserted  to  complete  phrase 
•••Error:  Unexpected  input  discarded 

11.  V  :=  1  +  j  +  k; 

12.  H  :=  X  +  y  +  z; 

•••Error:  "EHD"  inserted  to  complete  phrase  started  at  line  9,  column  5 

13. 

14.  function  f(x,y:real)  :  result  real; 
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***Error:  Unexpected  symbol  ignored 

16.  vax  z  :  real; 

16.  begin 

17.  2  :=  X  +  y 


**»Error:  "EHD"  inserted  to  complete  phrase  started  at  line  16,  column  5 
•♦♦Error:  ";"  inserted  to  complete  phrase  started  at  line  7,  column  1 
18. 

19.  begin 

20.  i  :=  1; 

21.  Bhile  i  <=  10  do 

22.  begin 

23.  j  :=  1; 

24.  Hhile  j  <=  10  do 

25.  begin 

26.  a[i]  :=  a[i]  +  b[j]  +  c[j]; 

27.  j  :=  j  +  1 

28.  end; 

29.  i  :=  1  +  1 

♦♦♦Error:  ;  expected  after  this  token 

30.  final  :=  f (x,y.z)^2.0/i+j+345.9; 

31.  Hhile  (x  =  y)  do  begin 

32.  if  y  =  z  then  begin 

33.  sub; 

34.  X  :=  ((y  +  z 


♦♦•Error:  ")"  inserted  to  complete  phrase 

♦♦♦Error:  ")"  inserted  to  complete  phrase 

♦♦♦Error:  "EHD"  inserted  to  complete  phrase  started  at  line  32,  colximn  23 

♦♦♦Error:  "ELSE"  inserted  to  complete  phrase  started  at  line  32,  column  9 

♦♦♦Error:  "END"  inserted  to  complete  phrase  statrted  at  line  31,  column  22 

♦♦♦Error:  "EHD"  inserted  to  complete  phrase  stairted  at  line  22,  column  9 

♦♦♦Error:  "EHD"  inserted  to  complete  phrase  statrted  at  line  19,  column  1 

•♦♦Error:  .  expected  after  this  token 


110 


Pascal2  Example  2 


1.  PROGRAM  PCIHPUT, OUTPUT); 

2.  BEGIH 

3.  FOR  Kl  :=  TO  HOELEMS  DO  X:=l 


***Error:  initial_value  expected  after  this  token 

***Error:  ;  expected  after  this  token 

4.  IF  X=l  THEH 

5.  BEGIN 

6.  WRITELN('0»EKD  OF  SORT*') 

***Error:  "END"  inserted  to  complete  phrase  started  at  line  6,  column  8 

7 .  ELSE 

8.  WRITELN('0**»LOOP  DETECTED  IN  INPUT  ORDER  RELATIONS***  '  )  ; 

9.  END. 
10. 

11.  PROGRAM  HUNTERMNPUT. OUTPUT'? 

< > 

**»Error:  (  expected  instead  of  this  token 
*»*Error:  progrcon.heading  expected  instead 

12.  VAR  Q: INTEGER; 

13.  BEGIN 

14.  END. 
15. 

16.  PROGRAM  P( INPUT, OUTPUT); 

17.  VAR  I,PRIHE.CHECK,NUMB:REAL,A:ARRAY\1. .6'  OF  REAL' 


***Error:  ;  expected  instead  of  this  token 
•••Error:  [  expected  instead  of  this  token 
•••Error;  ]  expected  instead  of  this  token 
***Error:  ;  expected  instead  of  this  token 

18.  BEGIN 

19.  END. 
20. 

21.  PROGRAM  P( INPUT. OUTPUT); 

22.  BEGIN 

23.  FOR  I  1  TO  6  DO  X:=l 

••*Error:    :=  expected  lifter  this  token 

24.  END. 
26. 

26.  PROGRAM  P ( INPUT , OUTPUT ) ; 

27.  BEGIN 

28.  CHECK:  1? 


Ill 


< > 

*»*Error:  Unexpected  input  discarded 

29.  BEGIN 

30.  WHILE  CHECK)'  PRIME  DO  X:=l 

< > 

*»*Error:  Unexpected  input  discarded 

31.  ESD 

32.  EFD. 
33. 

34.  PROGRAM  P (IHPUT. OUTPUT ) ; 

35.  VAR  I:REAL; 

36.  COHST  A[l]  10;A[2]  15;A[3]  25;A[4]  3;?A[S]  50;A[6]  75; 


♦♦•Error:  BEGIN  expected  instead  of  this  token 
***Error:  :=  expected  after  this  token 
♦♦•Error:  :=  expected  after  this  token 
♦••Error:  :=  expected  after  this  token 
•♦•Error:  :=  expected  after  this  token 
•♦♦Error:  Unexpected  symbol  ignored 
♦♦♦Error:  :=  expected  after  this  token 
♦♦•Error:  :=  expected  after  this  token 

37.  BEGIN 

38.  END. 

♦♦♦Error:  "END"  inserted  to  complete  phrase  started  at  line  36,  column  5 

39. 

40.  PROGRAM  P (INPUT, OUTPUT) ; 

41.  BEGIN 

42.  IF  X<=0  THEN  FACT  :=  1  ENSE  FACT  :=  X^FACT(X-l) 

♦♦♦Error:  ELSE  expected  instead  of  this  token 

43.  END; 

•♦♦Error:  Unexpected  symbol  ignored 

44.  BEGIN 

45.  END. 

•♦♦Error:  "END"  inserted  to  complete  phrase  etzirted  at  line  41,  column  6 

46. 

47.  PROGRAM  REID (IHPUT, OUTPUT) ; 

48.  CONST  A[1]=10;A[2]=15;A[3]=25;A[4]==35;A[5]=50;A[6]=75: 
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< 

**»Error:  BEGIH  expected  instead  of  this  token 

***Error: 

=  expected  instead  of  this  token 

♦♦♦Error: 

=  expected  instead  of  this  token 

♦♦♦Error: 

=  expected  instead  of  this  token 

♦♦♦Error: 

=  expected  instead  of  this  token 

♦♦♦Error: 

=  expected  instead  of  this  token 

♦♦♦Error:  Misplaced  construct(s) 

49. 

VAR  Q: INTEGER; 

51.  EHI 

). 

52. 

53.  PROGRAM  P (IHPUT. OUTPUT) ; 

S4. 

PROCEDURE  PR; 

65. 

BEGIN 

56. 

X:  =  l; 

57. 

BEGIN 

58. 

X:=l 

59. 

END  : 

■«w 

♦♦♦Error:  Misplaced  construct (s)  (from  line  54,  column  5  to  ...) 

60. 

VAR  FACT,  FACT2:  REAL  ; 

61.  BEGIN 

62.  EBI 

). 

63. 

64.  PROGRAM  P (INPUT, OUTPUT ) ; 

65.  BEGIN; 

<— 

— > 

♦♦♦Error:  procedure.or.lunction.declaration.list  expected  instead 

66. 

PROCEDURE  FACTR(N:  INTEGER   ;  VAR  FACTOR  :  INTEGER  )  ; 

67. 

BEGIN 

68. 

X:  =  l 

69. 

END  ; 

♦♦♦Error:  BEGIN  expected  after  this  token 

70.  X:=l 

71.  END. 
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Ada 

1 .  procedure  ATESTS  IS 

2.  x:  float  :=  2.1  +; 

•••Error:  term  expected  alter  this  token 

3.  a:  array  IITEGER  range  1..10  of  integer; 


•••Error:  (  expected  alter  this  token 
•••Error:  )  expected  alter  this  token 

4.     B:  ARRAY  [IITEGER  RABGE  0..9]  OF  FLOAT; 


•••Error:  (  expected  instead  ol  this  token 

••♦Error:  )  expected  instead  of  this  token 

5.  C  :  ARRAY  (BOOLEAN); 

< > 

•••Error:  [COKSTANT]constrained_array_def inition  expected  instead 

6.  type  t  is 

7.  RECORD 

8.  A  B:  CHARACTER; ; 


•••Error:  Unexpected  symbol  ignored 

•••Error:  Unexpected  symbol  ignored 

9.  EHD  RECORD; 

10.  type  b  IS  INTEGER  range  1..30; 

•••Error:  Unexpected  symbol  ignored 

11.  subtype  c  is  range  1..30; 

•••Error:  type.mark  expected  after  this  token 
12. 

13.  procedure  count  is 

14.  use  TEXT_IO; 

15.  x:  integer; 

•••Error:  BEGIN  expected  after  this  token 

16.  GET(x); 

17.  PUT(x);  a  bad  comment 

< > 

•••Error:  Unexpected  input  discarded 

18.  end  count; 
19. 

20.  procedure  q  is  seperate; 

•••Error:  SEPARATE  expected  instead  of  this  token 
21. 

22.  procedure  spell  is 

23.  B:  ARRAY  OF  FLOAT; 
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••♦Error:  index.constraint  expected  after  this  token 

24.  x:  integer; 

25.  FOR  I  IH  1  . .  10  LOOP 

•••Error:  BEGIH  inserted  before  this  token 

26.  b(i)  :=  0.0; 

•••Error:  "EHD  LOOP  ;"  inserted  to  complete  phrase  started  at  line  25,  column  7 


27. 

end; 

28. 

29. 

function 

DAYS_IS_MONTH(H: 

•••Error: 

;  expected 

after  this  token 

30. 

begin 

31. 

case 

M  of 

•••Error:  (  expected  instead  of  this  token 

32.  FEB  =>  RETURN  28; 
> 

•••Error:  Misplaced  construct(s)  (from  line  31,  column  9  to  ...) 

•••Error:  EXCEPTION  expected  after  this  token 

33.  when  APR  =>  30; 

•••Error:  RETURN  expected  after  this  token 

34.  Hhen  SEP  I  APR  I  JUN  I  I  NOV  =>  RETURN  30; 

•••Error:  Unexpected  symbol  ignored 

35.  Hhen  others  =>  return  31; 

36.  end  case; 
< > 

•••Error:  Unexpected  input  discarded 

37.  2(Y  -  S^J  +  K  REM  7)  >  0  LOOP 
< > 

•••Error:  Unexpected  input  discarded 

38.  X  :=  X  +  1; 

39.  go  to  label; 
< > 

•  ••Error:  Sjrmbols  merged  to  form  GOTO 

40.  END  LOOP; 

41.  X  :=  X  +  +  1; 

•••Error:  Unexpected  symbol  ignored 

42.  y  :=  ((  3  ; 


•••Error:  ")"  inserted  to  complete  phrase 
•••Error:  ")"  inserted  to  complete  phrase 


115 


43.  return  28; 

44.  «label 

•••Error:  »  expected  alter  this  token 

45.  return  29; 

46.  end  ol  DAYS.IS.MOHTH; 

•••Error:  Unexpected  symbol  ignored 
47. 

48.  PROCEDURE  P  IS 

49.  x:  integer  :=  2 

•••Error:  ;  expected  after  this  token 

50.  begin 

51.  loop 

52.  if  x  >  0  then  y  :=  2; 

53.  if  y  <  0  then  z  :=  3; 


•••Error:  "END  IF  ;"  inserted  to  complete  phrase 

•••Error:  "END  IF  ;"  inserted  to  complete  phrase  started  at  line  52,  column  13 

54.  end  loop; 

55.  end  p; 
56. 

57.  procedure  test  is 

58.  x:  array (123. 144)  of  real; 

< 

•••Error:  "..  simple.eipression"  inserted  to  complete  phrase 

59.  Y:  INTEGER  =  5; 

•••E^ror:  :=  expected  instead  of  this  token 

60.  begin 

61.  if  x(l)  :=  y  then 

•••Error:  Invalid  relational.operator 

62.  NULL; 

63.  ELSEIF  X(2)  >  Y  THEN 

•••Error:  ELSIF  expected  instead  of  this  token 

64.  NULL; 

65.  end  if; 

66.  end  atests; 

•••Elrror:  "BEGIN  sequence.of .statements  END  ;"  inserted  to  complete 
phrase  started  at  line  1,  column  1 
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Appendix  F 

Pascal2  Grammar 


This  grammar  is  an  LALR(2)  grammar  for  Pascal  derived  from  the  grammar  in  Appendix 
D  of  [11].  It  is  specified  in  standard  BNF  format  with  adjacent  symbols  separated  by  at 
least  one  blank.  The  "'/.empty"  symbol  denotes  the  empty  string.  All  reserved  words  are 
written  in  upper  case  letters.  The  lexical  categories  integerJiteral,  stringJiteral,  etc,  are 
viewed  as  terminals.  The  " — "  symbol  is  used  to  indicate  that  the  rest  of  the  line  is  a 
comment.  Comments  are  included  wherever  there  is  a  deviation  from  the  syntax  given  in 
Appendix  D. 

The  information  associated  with  this  grammar  is  broken  down  into  four  sections.  The 
first  section  is  a  "Terminals'"  section  in  which  the  terminals  of  the  grammar  are  listed. 
The  second  section  is  a  "Rules"  section  where  the  rules  of  the  grammar  are  specified. 
The  third  section  is  a  "Names"  section  where  symbols  of  the  grammar  are  mapped  into 
a  different  name  to  be  used  in  issuing  error  diagnosis.  The  fourth  and  final  section  is  a 
"Scopes"  section  where  the  scopes  that  were  computed  automatically  from  the  grammar 
are  given.  Each  scope  is  written  in  the  form  of  a  BNT  rule  with  a  ".'"  in  its  right-hand 
side  to  indicate  where  the  prefix  of  the  scope  ends  and  the  suffix  begins.  Note  that  a  scope 
does  not  necessarily  correspond  to  a  rule  of  the  grammar.  In  particular,  if  the  suffix  of  a 
scoped  rule  contains  nullable  nonterminals  in  its  suffix,  these  nullable  symbols  are  removed 
from  the  scope  suffix. 

Terminals 

PROGRAM  BEGIN  EHD  FUNCTION  PROCEDURE  VAR  TYPE  CONST  LABEL  GOTO  WITH 
DO  FOR  TO  DOWNTD  REPEAT  UNTIL  WHILE  CASE  OF  IF  THEN  ELSE  RECORD  SET 
FILE.tok  ARRAY  PACKED  OR  IN  DIV  MOD  AND  NOT  NIL  DIRECTIVE 
IDENTIFIER      STRING.LITERAL     INTEGER.LITERAL     REAL_LITERAL 
+   _•/   =   <><<=>>=   .   ..   ,   :   ;   :=(   )   [   ]   " 

Rules 

prograjE.list    ::=  '/.empty      I      program.list  program 

progreun    : : =  program_heading  block    . 

prograa.heading   ::=  PROGRAM  IDENTIFIER  (   1 ile.identif ier_list   )    ; 

lile.identilier   : :=  IDENTIFIER 

block    : :=  label_declaration_part  constaiit_delinition_part 
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type_deliiiition_part  variable_declaration_part 

procedure_suid_function_declaxation_pjLrt 

statement.paxt 
label_declaration_part    ::=  '/.empty      I      LABEL  label.list    ; 
label    : :=   integer.literal 

constant_deliiiition_part     ::=  '/.empty     I      COHST  constant.def inition.list 
constant_delinition   ::=   IDENTIFIER  =   constant 
constzmt    : :=  ■onsigned.number  I      sign  tmsigned_number 

I    constant_identilier      I      sign  constamt_identif ier 
I    string_literal 
imsigned.number    ::=   integer_literal      I      re2Q._literal 
sign    : :=  +      I      - 

consteint.identilier   ::=  IDEHTIFIER 

type.def inition.part    ::=  '/.empty      I      TYPE  type_delinition_list    ; 
type.delinition   ::=  IDEHTIFIER  =  type 

type      ::=  simple.type      I      stnictured_t3rpe      I      pointer_t3rpe 
simple.type    ::=  scalcir_type      I      subrange.type      I      type_identif ler 
scal2Lr_type   ::=  (   identilier_list  ) 
subramge.type   ::=   constant    ..    constant 
type.identif ler    ::=   IDENTIFIER 
stnictured.type    ::=  uiipacked_structured_type 

I      PACKED  \inpacked_structured_type 
unpacked_structured_type    ::=  array_type      I      record.type 

I    set_type  I      lile.type 

array.type   ::=  ARRAY    [   index_type_list  ]    OF  component.type 
index.type    :;=   simple.type 
component_type    ::=  type 
record.type    ::=  RECORD  field_list  END 

field.list    ::=  fixed.part    I    variant_part      I      fixed.part    ;    vsiriant_part 
fixed.paxt    ::=  record.section      I      fixed.part    ;    record_section 
record.section   ::=  '/.empty      I      f leld.identif ier_list    :    type 
variant _pcirt    ::=  CASE  tag.lield  type_identif ler  OF 

variant.list 
tag.field   ::=  '/.empty      I     lield.identilier   : 
lield.identifier    ::=   IDENTIFIER 

variant    ::=  '/.empty      I      case_label_list    :    (   field_list   ) 
case.label    ::=   constant 


set.type    : 
base_type 

lile.type 


SET  OF  base.type 
=  simple.type 
=  FILE  OF  type 
pointer.type  ::=  '  type.identif ler 

Vciriable.dfcclaration.part  ::=  '/.empty   I   VAR  variable_declaration_list 
vzLTiable.declaration  ::=  identilier.list  :  type 
procedure_and_iunction_decl2Lration_paa:^  ::=  '/.empty 

I  procedure_or_lunction_declaration_list  ; 
procedure_or_iunction_decljLration  ::=  procedure_declaration 

I  lunction.declaration 
procedure.declaration  : :=  procedure.heading  block 

I  procedure.heading  DIRECTIVE 
procedure.heading  ::=  PROCEDURE  IDENTIFIER  ; 
I  PROCEDURE  IDENTIFIER 
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(  f onnal_pauraiineter_section_liBt  )  ; 
lonnal_paraaeter_8ection  ::=  parsuneter_group   I   VAR  parameter .group 

I  FUNCTION  parameter .group 
I  PROCEDURE  identif ier.list 
parameter_group  ::=  identif ier_list  :  type_identif ier 
lunction.declaration  ::=  iunction_heading  block 

I   fuDction.heading  DIRECTIVE 
function.heading  ::=  FUNCTION  IDENTIFIER  :  result_type  ; 
I  FUNCTION  IDENTIFIER 

(  lormal_parameter_sectioii_list  )  :  result_type  ; 
result.type  ::=  type.identif ier 
statement.pcLTt  ::=  compound_statement 

statement  ::=  vmlabelled.statement   I   label  :  unlabelled.statement 
unlabelled.statement  :  :=  simple.statement   I   structiired_statement 
simple_statement  ::=  assigiunent_statement   I   procedure_statement 

I  goto_statement        I  empty_statement 
assignment.statement  ::=  Vciriable  :=  expression 

The  variable  on  the  left-hsind  side  ol  the  assignment  may  be  a 
f ile.identif ler ,  a  ref erenced_variable,  a  pointer  variable, 
a  simple  identifier,  Jin  indexed  variable,  a  f ield_designator , 
a  file_buffer  or  a  function  identifier. 

record_variable  ::=  vairiable 

variable  ::=  IDENTIFIER  I  variable  [  eipression.list  ] 
I  record.variable  .  IDENTIFIER 

I  variable  "  —  file  or  pointer  variable 

expression  ::=  simple.expression 

I  simple_expression  relational_operator  simple_expression 
relational.operator  : :=  =   |   <>   |   <   I   <=   I   >=   I   >   I  IN 
simple_expression  ::=  term   I   sign  term 

I  simple_expression  adding_operator  term 
adding_operator  ::=  +   1   -   I   OR 

term  ::=  factor   I   term  multiplying_operator  factor 
multiplying.operator  ::=  *   I   /   I   DIV   I   MOD   I   AND 
factor  ::=  vetriable       I  unsigned_constant 

I  (  expression  )  I  f unction.designator 
I  set  I  NOT  factor 

unsigned.constant  ::=  unsigned_number   I   string_literal   I   NIL 
f unction_designator  ::=  f unction_identif ier  (  actual_paTameter_list  ) 
function.identifier  ::=  IDENTIFIER 
set  ::=  [  ]   I   [  element _list  ] 

element  : :=  expression   I   expression  ..  expression 
procedure.statement  ::=  procedure_identif ler 

I  procedure.identif ier  (  actucil_parameter_list  ) 
procedure_identifier  ::=  IDENTIFIER 
actual_parameter  ::=  expression 

An  actual  pjirameter  may  be  a  simple  variable,  a  procedure 
identifier  or  a  function  identifier.  These  classes  are 
included  under  expression. 
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got o_ statement  ::=  GOTO  label 

empty_6tatement  ::=  y.empty 

stz^ctured.statement  ::=  compound. statement     I   conditional_statement 

I  repetitive.statement   I   Hith_6tatement 
restricted.statement  ::=  simple_statement 

I  compound.statement 
I  IF  expression  THEH 

restricted_statement  [;] 
ELSE  restricted_statement 
I  case.statement 
1  restricted_Bhile_6tatement 
I  repeat_statement 
I  restricted.! or_statement 
I  restricted_Hith_BtateBent 
compound.statement  ::=  BEGIN  statement.list  EHD 
conditional_statement  ::=  if .statement   I   case. statement 
if. statement  ::=  IF  expression  THEN 

statement 
I  IF  expression  THEN 

restricted_statement  [;] 
ELSE  statement 

Bote  that  the  use  of  [;]  m  the  specification  of  an  if.statement 
IS  em  extension  that  allocs  the  paxser  to  accept  such  a  statement 
with  a  ";"  preceeding  the  ''ELSE'".   With  this  arrangement,  a 
semantic  Baxnmg  message  is  emitted  when  the  ";"  is  present. 

case.statement  ::=  CASE  expression  OF 

case.l 1st .element. list 
EHD 
case.list. element  ::=  '/.empty   I   case.label.list  :  statement 
repetitive.statement  ::=  while. statement 

I   repeat. statement 
I   for. statement 
Hhile.statement  : :=  WHILE  expression  DO 

statement 
restricted_Bhile_statement  ::=  WHILE  expression  DO 

restricted. statement 
repeat. statement  ::=  REPEAT 

statement. list 
UNTIL  expression 
f or.statement  ::=  FOR  control.variable  :=  for. list  DO 

statement 
restricted.f or.statement  : :=  FOR  control.variable  :=  for.list  DO 

restrict ed.statement 
for.list  ::=  initial.value  TO  f mail. value 

I  initial.value  DOWNTO  linal.value 
control.variable  ::=  IDENTIFIER 
initial.value  ::=  expression 
final.value  : :=  expression 
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with.statement   ::=  WITH  record_v2Lriable_list  DO 

statement 
restricted_Hith_statement   ::=  WITH  record_variable_list  DO 

restricted_statenient 

—  Expansion  of   lists 


index_type_list 
expression_list 
identif ier_list 


:=  index_type      I      index_type_list    ,    index_type 
:=  expression      I      expression_list    ,    expression 
:=  identifier      I      identif ier_list    ,    identifier 
f ield_identif ier_list    ::=  f ield_identif ler 

I   field.identif ier_list   ,  f ield.identif ier 
f ile_identif ier_list    ::=  f ile.identif ier 

I    lile_identifier_list    ,   f ile_identif ier 
label_list    ::=  label      I      label.list    ,   label 
const2Lnt_def inition_list    ::=  constajit_def inition 

I    constant.def inition_list    ;    constcint_def inition 
type_def inition_list   ::=  t]rpe_def inition 

I    type_def inition.list    ;    type_def inition 
procedure_or_f uiiction_declaration_list    :  :=  procedure_or_fiinction_declaration 

I    procedure_or_f iinction_declaxation_list 
procedure_or_function_declaxation 
variant_list    ::=  variant      I      vaxiant.list    ;    vciriant 
case_label_list    ::=   case.label      I      case_label_list    ,    case_label 
VcLriable_declaration_list    ::=  variable.declaration 

I    variable_declaration_list    ;    variable.declaration 
f onnal_parcLmeter_section_list    ::=  f onnal_paranieter_section 

I    f onnal_paranieter_section_list    ;    f onnal_pcirauneter_section 
actual_paraineter_list    :  :=  actual.pairaineter 

I    actual_paxaiDeter_list    ,    actual.paxauneter 
element_list    ::=   element      I      element_list    ,    element 
statement_list    ::=   statement      I      statement_list    ;    statement 
case_list_element_list    ::=   case_list_element 

I    case_list_element_list   ;   case_list_element 
record_vaLriable_list    ::=  record_variable 

I    record_variable_list    ,   record_variable 

—   Expand  the  optional  semicolon  nonterminal  that  may  appear  before  ELSE 

[;]    :  :=  '/.empty 

I    ;  —   Issue  a  saming  when  this  rule   is  reduced 

Names 

EOFT.SYHBOL  ->    'End  of  Pascal  source' 

Scopes 

if_statement  : :=  IF  expression  THEN  restricted.statement  [;3   .ELSE 
restricted_statement  ::=  IF  expression  THEH  restricted.statement  [;]   .ELSE 
block  : :=  label_declaration_part  constant_def inition_part 
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type.def inition.paxt  variable_declaLration_paLrt 
procedure_and_f uiiction_declaration_part      . statement _part 

case_statement    ::=  CASE  expression  OF  case_list_element_list      .EHD 

variant    ::=   case_label_list   COLOH    (   field.list      .) 

function_designator   ::=  f unction.identif ler  (   actual_paraineter_list      .) 

variable    ::=  variable    [  expression_list      .] 

repeat_statement    ::=  REPEAT  statement_list      .UBTIL  expression 

compound_stateinent   ::=  BEGIB  statement_list      .E!fD 

set    ::=    [  element_list      .] 

factor    ::=   (   expression      .) 

record.type    ::=  RECORD  lield_list      .EHD 

procedure_and_function_declaj-ation_paxt    : := 

procedure_or_fmiction_declaxation_list      .  ; 
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